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Abstract 

The  453d  Electronic  Warfare  Squadron  supports  on-going  military  operations  by 
providing  battlefield  commanders  with  aircraft  ingress  and  egress  routes  that  minimize 
the  risk  of  shoulder  or  ground-fired  missile  attacks  on  our  aircraft.  To  determine  these 
routes,  the  453d  simulates  engagements  between  ground-to-air  missiles  and  allied  aircraft 
to  determine  the  probability  of  a  successful  attack.  The  simulations  are  computationally 
expensive,  often  requiring  two-hours  for  a  single  10-second  missile  engagement. 
Hundreds  of  simulations  are  needed  to  perform  a  complete  risk  assessment  which 
includes  evaluating  the  effectiveness  of  countermeasures  such  as  flares,  chaff,  jammers, 
and  missile  warning  systems.  Thus,  the  need  for  faster  simulations  is  acute. 

This  research  speeds  up  these  mission  critical  simulations  by  using  inexpensive 
commodity  PC  graphics  cards  to  perform  intensive  image  processing  computations  used 
to  simulate  a  heat  seeking  missile’s  tracking  system.  The  innovative  techniques 
developed  in  this  research  reduce  execution  time  by  33%  and  incorporate  a  user- 
selectable  fidelity  feature  to  perform  high-fidelity  simulations  when  required. 
Furthermore,  these  image  processing  computations  use  only  5%  of  the  available 
computational  capacity  of  the  graphics  cards,  providing  a  ready  source  of  additional 
computational  power  for  future  simulation  enhancements. 

Analysts  can  now  meet  shorter  suspenses  with  more  accurate  products,  ultimately 
enhancing  the  safety  of  Air  Force  pilots  and  their  weapon  systems.  With  ongoing 
operations  in  Iraq  and  Afghanistan,  and  a  growing  threat  at  home  and  abroad  posed  by 
the  proliferation  of  man-portable  missiles,  the  speed  of  these  simulations  play  an 
important  role  in  protecting  forces  and  saving  lives. 
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ACCELERATING  MISSILE  THREAT  ENGAGEMENT  SIMULATIONS  USING 


PERSONAL  COMPUTER  GRAPHICS  CARDS 

I.  Introduction 

Motivation  for  this  research  comes  from  two  fronts.  First,  a  review  of  the  literature 
reveals  that  commodity  graphics  accelerator  cards,  found  in  almost  every  personal 
computer  on  the  market  today,  have  reached  a  level  of  power  and  programmability  that 
enables  them  to  be  used  as  high  performance  stream  computers,  adaptable  to  a  variety  of 
general  purpose  computing  tasks  [MoA03][Mor03][RuS01][KrW03][LaM01][LWK03]. 
Further,  these  devices,  commonly  referred  to  as  Graphics  Processing  Units  (GPU),  can 
actually  outperform  the  modem  CPU  in  a  range  of  computationally  intensive  applications 
[TrS01][KrW03][BFH04][LWK03].  The  GPU  therefore  represents  a  powerful,  untapped 
resource  with  the  potential  to  provide  a  sizeable  performance  boost  for  little  to  no  extra 
cost1. 

The  second  motivation  for  this  research  stems  from  a  mission  requirement.  The  453d 
Electronic  Warfare  Squadron,  part  of  the  Air  Force  Information  Warfare  Center 
(AFIWC),  is  exploring  ways  to  speed  up  the  execution  of  computer-based  simulations, 
specifically  those  used  to  evaluate  the  effectiveness  of  the  countermeasures,  such  as 
flares,  chaff,  jammers,  and  missile  warning  systems,  used  by  USAF  aircraft  against 
missile  threats.  AFIWC  uses  the  Joint  Modeling  and  Simulation  System  (JMASS)  Threat 
Engagement  Analysis  Model  (TEAM)  software  to  ran  simulated  engagements  between 
missile  threats  and  friendly  aircraft,  under  various  maneuver  and  environmental 
conditions,  evaluating  scenarios  for  the  warfighter  that  would  be  cost  prohibitive  or 
logistically  impossible  to  obtain  otherwise.  The  results  of  AFIWC  threat  analyses 
1  Mainstream  graphics  cards  range  in  price  from  about  $60  or  less  to  about  $500. 
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determine  the  adequacy  of  existing  countermeasures,  tactics,  techniques  and  procedures, 
and  are  used  in  the  development  of  new  ones.  With  ongoing  operations  in  Iraq  and 
Afghanistan,  and  a  growing  threat  abroad  posed  by  the  proliferation  of  man-portable 
missiles,  AFIWC  simulations  play  an  important  role  in  protecting  forces  and  saving  lives. 

Unfortunately,  JMASS  simulations  take  a  long  time  to  execute:  up  to  two  hours  to 
simulate  a  10-second  engagement.  This  is  a  problem  for  several  reasons.  To  provide  the 
best  possible  analysis,  hundreds  of  simulations  must  often  be  done  to  cover  the  many 
variations  of  position,  maneuver,  and  environment  for  a  given  scenario.  The  quality  of 
analysis  is  therefore  constrained  both  by  the  amount  of  time  available  for  conducting 
simulations  and  the  JMASS  execution  time.  When  operating  under  a  short  suspense, 
quality  can  suffer.  Further,  the  missiles  are  becoming  smarter,  able  to  identify  target 
features  at  ever  increasing  levels  of  detail.  Correspondingly,  there  is  an  increasing  need 
for  higher- fidelity  simulations,  which  of  course  requires  more  time  to  execute  due  to  the 
increased  amount  of  computation  required.  JMASS  is  generally  run  on  high-end  personal 
computers  and  multiprocessor  workstations.  Though  the  speed  of  these  machines 
continues  to  increase,  it  has  not  been  sufficient  to  match  the  demand  for  faster  and  more 
detailed  simulations. 

To  address  these  concerns,  AFIWC  initiated  a  collaborative  effort  with  the  Air  Force 
Research  Laboratory,  Naval  Sea  Systems  Command  and  the  Air  Force  Institute  of 
Technology  to  develop  a  hardware-based  means  for  accelerating  the  image  processing 
calculations  thought  to  present  the  greatest  computational  load  during  JMASS 
simulations.  Since  this  requirement  emphasizes  performance  in  the  processing  of 
graphical  information,  it  seemed  worthwhile  to  apply  today’s  flexible  and  powerful  GPUs 
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toward  providing  a  low-cost,  potentially  high-payoff  solution.  The  remainder  of  this 
chapter  provides  an  overview  of  the  JMASS  simulation  process  and  a  detailed 
characterization  of  the  problem  posed  by  AFIWC. 

JMASS  Background  and  Characterization  of  AFIWC  Requirement 

JMASS  simulations  execute,  as  do  most  simulations,  in  discrete  steps  that  model  the 
state  of  the  system  at  regular  intervals  of  simulated  time.  This  interval  is  called  the  model 
time  step,  and  can  be  thought  of  as  either  the  simulation’s  time  resolution,  or  the  rate  at 
which  the  simulation  “samples”  the  simulated  world  [Air04],  In  JMASS,  the  model  time 
step  is  usually  set  to  update  the  simulated  environment  in  1/250  second  (equivalently, 
0.004  second)  intervals.  During  each  time  step,  the  JMASS  simulator  generates  a  digital 
image  to  simulate  the  missile’s  current  infrared  (IR)  field  of  view,  essentially  mimicking 
the  way  the  world  would  appear  to  a  missile  during  flight.  The  image  is  submitted  to  a 
mathematical  model  representative  of  a  particular  missile’s  electro-optical  sensor  (a.k.a. 
seeker)  and  control  system  path,  and  the  missile’s  response  (i.e.,  maneuver  or  change  in 
direction)  is  fed  back  to  the  JMASS  simulator  for  generating  the  next  scene.  This 
iterative  and  interactive  process  of  scene  generation  and  missile  optics  response  occurs 
about  2,500  times  to  simulate  a  10-second  engagement. 

Of  specific  interest  are  the  image  processing  calculations  for  modeling  the  optical  path 
of  the  seeker,  since  this  is  where  JMASS  appears  to  spend  most  of  its  runtime.  A  typical 
infrared  seeker  is  positioned,  not  surprisingly,  at  the  front  of  the  missile  and  consists  of 
an  IR-transparent  dome  followed  by  a  set  of  optics  not  unlike  a  telescope.  The  optics 
focus  incoming  light,  presumably  emanating  from  the  missile’s  prospective  target, 
through  a  rapidly  spinning,  partly  transparent  disc,  called  a  reticle,  which  modulates  the 
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light  and  passes  it  to  an  IR  detector.  The  reticle  is  specially  designed  to  modulate  the 
light  in  such  a  way  that  the  position  of  the  target  relative  to  the  center  of  the  missile’s 
field  of  view  can  be  determined  from  the  modulated  signal.  The  missile’s  control  system 
uses  this  signal  to  guide  the  missile  to  the  target  [MaV83]. 

JMASS  simulates  the  seeker  system  described  above  by  modeling  the  interaction 
between  the  spinning  reticle  and  the  incoming  IR  scene.  It  accepts  IR  scene  images  as 
input,  and  produces  a  reticle-modulated  signal  as  output.  The  calculations  associated 
with  this  step  in  the  simulation  process,  described  in  the  following  paragraphs,  are  the 
subject  of  AFIWC’s  hardware  acceleration  initiative,  and  likewise,  the  candidate  for 
potential  GPU  acceleration. 

Prior  to  beginning  a  JMASS  simulation,  a  data  structure  is  initialized  to  model  the 
reticle.  The  reticle  image  is  represented  as  a  static  480  x  480  element  array,  with  each 
element  (or  pixel)  containing  a  floating  point  number  whose  value  is  between  zero  and 
one,  indicating  the  degree  to  which  each  point  on  the  reticle  permits  light  to  pass  through 
it.  The  reticle  image  for  the  chosen  missile  is  loaded  into  CPU  memory  from  a  data  file 
prior  to  the  start  of  the  simulation. 

For  each  model  time  step,  JMASS  determines  an  appropriate  angular  displacement  for 
the  reticle  (recall  the  reticle  is  spinning),  then  creates  a  rotated  copy  by  performing  a 
linear  coordinate  transformation  on  the  original.  The  rotated  reticle  image  may  be  resized 
to  match  the  resolution  of  the  IR  scene  produced  by  the  simulator,  then  interpolated  by 
one  of  four  selectable  algorithms  to  smooth  any  artifacts  that  may  have  been  caused  by 
the  rotation  and  resizing  transformations.  JMASS  performs  an  element-by-element 
multiplication  of  the  rotated,  smoothed  reticle  image  with  the  current  IR  scene  to  produce 
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a  new  image,  one  that  represents  the  IR  scene  filtered  (or  attenuated)  by  the  reticle. 
Finally,  the  values  of  all  the  pixels  of  this  resultant  image  are  summed  to  produce  a  single 
radiance  value.  This  value  represents  the  light  intensity  that  would  be  incident  on  the 
missile’s  IR  detector  given  the  input  scene  and  reticle  orientation  at  a  particular  instant  in 
simulated  time. 

Recall  the  field  of  view  is  updated  (i.e.,  a  new  IR  scene  is  produced  by  the  JMASS 
simulator)  250  times  per  simulated  second.  However,  because  the  spinning  reticle  results 
in  a  modulated  detector  signal  with  frequency  on  the  order  of  1-2  kHz,  sampling  theory 
requires  a  minimum  sampling  rate  of  4,000  samples  per  simulated  second.  AFIWC  has 
specified  a  higher,  10  kHz  sampling  rate  to  protect  against  aliasing.  Since  the  JMASS 
simulator’s  250  Hz  simulation  step  falls  well  short  of  this,  each  scene  must  be  multiplied 
by  forty  reticle  images  (each  requiring  a  different  amount  of  rotation,  followed  by 
resizing  and  interpolation),  and  forty  sums  produced,  to  provide  the  10,000  samples  per 
simulated  second  to  replicate  the  detector  signal.  Figure  1-1  below  presents  a  simplified 
view  of  this  process. 


JMASS  Image  Processing  for  Missile  Flight  Simulation 


element-by-element  multiply 


-  perform  40x  per  simulation  step  (on  40  differently-rotated  reticles) 

-  perform  10,000  times  per  simulated  second 


Figure  1-1.  Flow  JMASS  models  missile  optics  to  produce  simulated  IR  detector  signal. 
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For  each  model  time  step,  JMASS  performs  the  rotate,  interpolate,  multiply,  and  add 
operations  described  above  as  a  series  of  separate  0(N  )  computations  where  N is  the 
width  (and  height  for  square  images),  in  pixels,  of  the  images  being  operated  on. 
Depending  on  the  size  of  the  images  used,  this  could  require  on  the  order  of  75  million 
double  precision  floating  point  calculations  per  model  time  step,  or  19  billion  calculations 
per  simulated  second  .  The  JMASS  software  is  written  in  C++,  for  the  most  part,  and 
executes  on  a  Windows  or  Unix-based  platform.  To  provide  a  concrete  example,  it  takes 
about  two  hours  for  JMASS,  running  on  a  2.8  GHz  Pentium  4,  using  512  -sized  images, 
to  simulate  a  10-second  engagement. 

The  optics  calculations  described  above  model  the  behavior  of  a  spin  scan  seeker. 
Generally,  missiles  employ  one  of  two  types  of  seekers,  spin  scan  or  conical  scan. 

JMASS  can  simulate  both  types.  Conical  scan  is  similar  to  spin  scan  except  that  the  IR 
scene  is  larger  (generally  twice  the  height  and  width  of  the  reticle  image),  and  prior  to 
performing  the  reticle-scene  multiply-add  operation,  the  reticle  image  is  shifted  with 
respect  to  the  scene  by  a  set  of  specified  x-y  offset  values,  in  pixels.  The  offset  can  be 
different  for  each  of  the  forty  reticle  images  used  during  a  model  time  step.  To  be  of 
greatest  use  to  AFIWC,  a  GPU  implementation  should  support  both  spin  scan  and  conical 
scan  seekers. 

In  addition  to  the  GPU-based  effort  that  is  the  subject  of  this  research,  AFIWC  is 
investigating  Field  Programmable  Gate  Array  (FPGA)  technology  to  accelerate  both 
software-based  (like  JMASS)  and  real-time,  so-called  “hardware-in-the-loop” 

2  Assuming  a  256x256  size  image  and  bilinear  interpolation.  This  accounts  for  floating  point  addition, 
multiplication,  and  sin  and  cos  operations,  but  does  not  include  instructions  for  performing  loops,  lookups 
or  array  index  calculations.  Interpolation  requires  10  floating  point  operations  per  image  pixel,  rotation 
requires  16,  and  the  multiply-add  about  2. 
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simulations,  which  interface  with  real  missile  hardware.  Since  software -based 
simulations  are  not  performed  in  real-time,  they  stand  to  benefit  from  any  amount  of 
speedup  that  can  be  provided.  However,  this  is  not  the  case  for  real-time  simulations 
which  must  either  sustain  a  throughput  of  19  GFLOPS  or  fail.  Whether  or  not  this  kind 
of  performance  is  within  the  capabilities  of  FPGAs  remains  to  be  determined,  and  is 
beyond  the  scope  of  this  research.  However,  as  will  be  shown  later  in  this  thesis,  such 
performance  is  almost  certainly  beyond  the  current  capabilities  of  graphics  cards. 
Therefore,  any  performance  gains  be  realized  through  a  GPU  will  likely  only  benefit 
software-based  simulations. 

As  indicated  throughout  this  section,  the  AFIWC  hardware  acceleration  initiative  is 
predicated  on  the  assumption  that  image  processing  calculations  are  the  source  of  the 
performance  bottleneck,  and  should  therefore  be  the  prime  target  for  optimization  efforts. 
Indeed,  an  analysis  of  the  JMASS  C++  code  supports  this  assumption,  since  the  bulk  of 
the  calculations  reside  in  the  0(N2)  code  structure  which  performs  the  image  processing 
calculations  [Joi04],  However,  if  this  is  not  the  case,  optimizing  the  image  processing 
calculations  may  not  be  enough.  According  to  Amdahl’s  Law  [HeP96],  if  other 
bottlenecks  exist,  they  could  reduce  the  effectiveness  of  even  the  most  spectacular  image 
processing  performance  gains  provided  by  a  GPU  or  FPGA.  This  does  not  diminish  the 
importance  of  these  hardware  acceleration  efforts.  However,  it  suggests  adopting  a 
system-wide  approach  in  addressing  the  JMASS  performance  issue. 
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II.  Literature  Review 


The  General  Purpose  GPU 

Almost  every  personal  computer  available  today  comes  equipped  with  dedicated 
graphics  acceleration  hardware,  either  built-in  to  the  motherboard  or  provided  as  an  add¬ 
in  circuit  card.  Though  graphics  co-processors,  or  Graphics  Processing  Units  (GPU)  as 
the  industry  refers  to  them,  have  become  commodity  items  in  personal  computers,  what  is 
not  generally  recognized  is  these  devices  have  become  formidable  computing  machines 
in  their  own  right,  exceeding  the  modern  desktop  CPU  in  terms  of  raw  computational 
power. 

For  example,  Macedonia  [Mac03]  reported  a  20  GFLOPS  peak  performance  of  the 
Nvidia  GeForce  FX  5900,  a  mainstream  GPU  in  2003,  to  be  equivalent  to  a  10  GHz  Intel 
Pentium.  It  is  interesting  that  GPUs  achieve  such  performance  running  at  much  slower 
clock  rates  than  CPUs,  the  result  of  a  highly  parallelized  architecture.  Typical  GPU  clock 
rates  range  from  233  to  400  MHz,  while  current  CPU  clock  rates  are  on  the  order  of  a 
few  GHz.  Current  models  of  GPU  contain  220  million  transistors,  the  bulk  of  which  are 
dedicated  to  parallel  processing  of  input  streams,  whereas  Intel’s  Xeon  CPU  has  only  108 
million  transistors,  60  percent  of  which  are  devoted  to  cache  memory  [Mac03],  Equally 
impressive,  the  growth  of  GPU  performance  has  exceeded  Moore’s  Law  [MoA03], 
increasing  at  a  rate  of  2.8  times  per  year  since  1993,  and  is  expected  to  continue  at  this 
rate  for  another  five  years,  perhaps  achieving  tera-FLOP  performance  by  2005  [Mac03], 

While  the  main,  market-driven  purpose  of  the  GPU  continues  to  be  providing 
increased  resolution,  dynamic  range,  frame  rates  and  programmability  to  keep  pace  with 
the  demand  for  ever  more  realistic  games  and  multimedia  applications,  these  same 
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advances  have  resulted  in  an  important,  perhaps  revolutionary  side  benefit:  the 
architecture  of  the  modem  programmable  GPU  has  become  so  flexible  it  is  possible  to 
exploit  its  inherent  computational  power  for  many  general-purpose  computing  tasks 
faster  than  they  can  be  done  on  a  CPU  [TrS01][KrW03][BFH04][LWK03],  These 
developments  have  not  been  lost  on  a  number  of  researchers  who  have,  especially  over 
the  past  four  years,  successfully  used  a  GPU  to  accelerate  a  myriad  of  general-purpose 
computing  tasks.  Just  a  few  of  the  diverse  examples  include  linear  algebra 
[Mor03][KrW03][LaM01],  finite  element  analysis  [RuSOl],  lattice  Boltzmann 
computation  [LWK03]  and  Fast  Fourier  Transform  calculations  [MoA03]. 

GPU  Architecture 

The  ability  to  use  the  GPU  for  general  purpose  computing  results  from  its  evolution 
over  the  past  decade  from  a  fixed-function  pipeline  architecture,  to  a  fully  programmable 
Single  Instruction  Multiple  Data  (SIMD)  parallel,  or  streaming,  processor 
[MoA03][BFH04],  This  section  describes  the  GPU  architecture. 

A  stream  is  simply  a  collection  of  data  operated  on  in  parallel  [BFH04],  The  GPU  is 
optimized  for  rendering  images,  a  task  that  involves  performing  fast,  parallel  operations 
on  large  streams  of  data.  As  such,  most  GPUs  include  their  own  high-bandwidth  memory 
subsystems  for  storing  and  manipulating  graphical  data.  For  example,  the  current  top-of- 
the-line  mainstream  GPU  from  nVidia,  the  GeForce  6800,  has  256  MB  of  memory 
accessible  via  a  256-bit  bus  with  an  advertised  bandwidth  of  35.2  GB  per  second  [Nvi04], 
In  late  2004,  3DLabs  is  expected  to  make  available  its  high-end  Wildcat  Realizm  800 
GPU  with  640  MB  memory,  512-bit  bus,  and  an  advertised  memory  bandwidth  of  64 
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GB/second  [Pci04a].  By  way  of  comparison,  the  Intel  875  chipset  that  supports  the 
Pentium  IV  only  provides  6.4  GB/second  CPU-to-main  memory  bandwidth  [Int04a]. 

In  general,  the  GPU  processes  two  kinds  of  data:  vertices  and  textures  [Mor03]. 
Vertices  represent  points  in  space  and  are  used  to  build  graphical  primitives,  such  as 
polygons,  which  can  be  assembled  to  form  complex  3-dimensional  objects.  Vertices 
possess  attributes  such  as  color,  position  vector  and  texture  coordinates,  which  are  stored 
in  registers  and  can  be  operated  on  by  various  functions  [THO02],  Textures,  on  the  other 
hand,  are  1,  2,  or  3-dimensional  images  applied  to  polygons,  much  like  wallpaper  or 
shrink-wrap,  to  impart  the  look  of  a  realistic  surface.  Textures  are  stored  in  GPU 
memory  as  arrays  of  pixels,  and  each  texture  pixel  is  represented  by  a  four-component 
vector,  holding  the  intensity  values  for  red,  green,  blue  and  alpha  (RGB A)  color 
channels. 

To  render  an  image,  a  user  application  must  provide  the  GPU  a  set  of  vertices  and/or 
textures.  Some  or  all  of  the  data  may  already  be  in  GPU  memory,  left  over  from  previous 
operations;  otherwise,  data  must  be  uploaded  to  the  GPU.  The  GPU  can  retrieve  large 
blocks  of  data  from  CPU  main  memory  via  DMA.  To  prevent  fast  GPUs  from  becoming 
data-starved,  modem  PC  busses  include  a  dedicated  interface  for  the  GPU,  the  Advanced 
Graphics  Port  (AGP),  which  provides  a  2  GB  per  second  path  between  the  GPU  and  main 
memory.  This  figure  will  increase  to  4  GB  per  second  when  computers  based  on  the 
next-generation  PCI  Express  bus  standard  become  available  within  the  next  year 
[Int04b][Pci04b],  Unfortunately,  DMA  hardware  is  not  provided  for  transferring  data 
quickly  in  the  opposite  direction.  Such  a  capability  is  important  since  any  significant  use 
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of  the  GPU  for  general-purpose  computing  requires  transferring  GPU-computed  results 
into  CPU  main  memory  for  further  processing  [THO02], 

Once  the  appropriate  data  has  been  loaded  into  the  GPU,  it  proceeds  through  the  GPU 
pipeline  in  the  following  general  sequence.  First,  the  GPU  generates  geometry  using  the 
vertex  information  provided  by  the  user  application.  The  GPU  transforms  the  geometry 
into  a  chosen  coordinate  frame,  clips  it  to  fit  within  a  specified  viewport,  or  drawing 
rectangle,  if  need  be,  and  applies  lighting  and  color  calculations  [THO02],  Next,  the 
GPU  applies  textures  to  the  geometry,  and  passes  everything  to  the  rasterizer  which 
converts  the  vector-based  geometry  data  into  a  pixel-based  representation  for  rendering 
[THO02],  These  pixels,  as  they  exist  prior  to  rendering,  are  referred  to  as  fragments. 
Finally,  the  pixels  are  rendered  into  a  section  of  GPU  memory,  called  the  frame  buffer 
[LaMOl],  for  display  on  the  screen. 

The  functions  of  the  GPU  are  accessible  via  an  Application  Programmer  Interface 
(API)  such  as  OpenGL,  created  by  Silicon  Graphics,  or  Microsoft’s  DirectX.  These 
provide  standardized  interfaces,  data  types  and  functions  to  access  the  features  of  many 
GPUs.  The  extent  the  API  feature  set  is  supported  or  extended  depends  on  the  GPU 
manufacturer. 

What  remains  to  be  explained  is  how  the  GPU  architecture  can  be  applied  to 
solving  general-purpose  computing  problems.  The  following  from  [TrSOl]  addresses  this 
and  nicely  captures  the  motivation  behind  using  the  GPU  for  general-purpose 
computation: 

Modem  raster  graphics  implementations  typically  have  a  number  of  buffers  with  a  depth  of  32  bits  per 
pixel  or  more.  In  the  most  general  setting,  each  pixel  can  be  considered  to  be  a  data  element  upon  which 
the  graphics  hardware  operates.  This  allows  a  single  graphics  language  instruction  to  operate  on  multiple 
data  as  in  a  SIMD  machine. 
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Since  the  bits  associated  with  each  pixel  can  be  allocated  to  one  of  four  components,  a  raster  image  can 
be  interpreted  as  a  scalar  or  vector  valued  function  defined  on  a  discrete  rectangular  domain  in  the  xy  plane. 
The  luminance  value  of  a  pixel  can  represent  the  value  of  the  function  while  the  position  of  the  pixel  in  the 
image  represents  the  position  in  the  xy  plane.  Alternatively,  an  RGB  or  RGBA  image  can  represent  a  three 
or  four  dimensional  vector  field  defined  over  a  subset  of  the  plane.  The  beauty  of  this  kind  of  interpretation 
is  that  operations  on  an  image  are  highly  parallelized  and  calculations  on  entire  functions  or  vector  fields 
can  be  performed  very  quickly  in  graphics  hardware. 

Further,  typical  scientific  computing  applications  perform  at  about  1%  of  peak  (CPU) 
processor  performance.  Recall  a  CPU  cache  hierarchy  excels  when  it  performs  repeated 
operations  on  a  block  of  data,  but  suffers  when  the  block  of  data  exceeds  the  cache  size. 
The  GPU,  however,  generally  has  much  more  memory  capacity  than  a  CPU  cache,  and  is 
capable  of  performing  operations  in  parallel  [RuSOl]. 

Using  the  GPU  Fixed-function  Pipeline 

An  early  attempt  to  use  the  GPU  for  general-purpose  numerical  computation  used  the 
fixed- function  pipeline  of  the  GPU  to  perform  matrix  multiplication.  2D  textures  stored 
the  matrices,  with  matrix  element  values  stored  as  individual  pixels  within  the  textures. 
For  reasons  to  be  discussed  later,  the  technique  of  using  textures  versus  vertices  to 
represent  data  in  the  GPU  is  widespread  in  the  literature.  The  matrix  multiplication 
algorithm  referred  to  above  exploits  the  spatial  parallelism  of  GPU  computation, 
performing  a  series  of  element-by-element  multiplications  of  texture  pairs,  with  element- 
by-element  additions  performed  in  between  to  accumulate  results  [LaMOl]. 

To  implement  the  algorithm,  a  pair  of  order-/?  square  matrix  multiplicands  A  and  B  are 
preprocessed  using  the  CPU  to  create  two  new  sets  of  textures,  A’  and  B’,  each 
containing  n,nxn  textures,  such  that  the  i  th  texture  in  A’  contains  the  i  th  column  from 
A  copied  across  its  columns,  and  the  i  th  texture  of  B’  contains  the  i  th  row  from  B  copied 
across  its  rows.  Figure  2-1  shows  an  example  using  2x2  matrices.  As  if  dealing 
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corresponding  cards  from  two  decks,  the  i  th  textures  from  A’  and  B’  are  transferred  to 
the  GPU  in  pairs  and  multiplied  element-by-element  in  what  is  called  a  multi-texturing 
operation.  Multi-texturing  takes  two  textures  as  operands  and  combines  them  in  one  of 
several  user-selectable  ways  to  produce  an  output  texture.  In  this  example,  each  pair  of 
textures  is  multiplied  using  the  “modulate”  multi-texturing  mode,  applied  to  a  single 
quadrilateral  fragment  in  the  rasterization  stage  of  the  GPU  pipeline,  then  rendered  to  the 
frame  buffer.  To  accumulate  results,  the  output  of  each  texture  multiply  is  rendered  to 
the  frame  buffer  using  the  “sum”  texture  blending  mode.  In  this  mode,  rendering  causes 
the  contents  of  the  rasterizer  to  be  added,  pixel-by-pixel,  with  the  existing  contents  of  the 
frame  buffer,  thereby  allowing  the  accumulation  of  results  in  the  frame  buffer  [LaMOl]. 

Using  this  technique  two  order- 1024  square  matrices  were  multiplied  in  0.546  seconds 
on  the  nVidia  GeForce3  [LaMOl].  This  time  includes  converting  matrices  to  texture 
maps,  transfering  the  textures  to  GPU  memory,  performing  the  calculations,  copying  the 
frame  buffer  back  to  CPU  main  memory,  and  converting  back  to  matrix  format.  GPU 
performance  is  compared  to  a  CPU-based  benchmark,  Automatically  Tuned  Linear 
Algebra  Software  (ATLAS)  running  on  a  Pentium  IV.  However,  direct  comparison  is  not 
possible  because  then-current  GPUs  were  only  capable  of  8-bit  fixed  point  arithmetic, 
and  ATLAS  performed  its  calculations  in  32-bit  floating  point.  To  acknowledge  this 
difference,  GPU  performance  is  stated  in  terms  of  byte  operations  per  second  (BOPS), 
and  compared  with  ATLAS’s  FLOPS. 

For  the  order- 1024  matrix  multiply,  the  GPU  achieved  4.4  GBOPS  and  ATLAS 
yielded  4.0  GFLOPS.  Though  no  execution  time  metric  is  provided  for  ATLAS 
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(b)  Matrix  A  columns  and  matrix  B  rows  copied  into  texture  sets  A’  and  B\  Corresponding  textures 
multiplied  element-by-element  using  GPU  multi-texturing.  Final  result  is  computed  in  GPU  by  adding 
results  in  texture  blending  operation. 


Figure  2-1.  A  technique  for  multiplying  matrices  using  GPU  fixed- function  pipeline  and  textures  [LaMOl]. 


[LaMOl],  ATLAS  running  on  a  Pentium  IV  can  multiply  two  order- 1000  matrices  in 
about  0.5  seconds  [Mor03],  which,  precision  issues  aside,  is  comparable  to  the  0.546 
GPU  time  achieved  in  [LaMOl], 

For  large  operations,  such  as  multiplying  twenty  order- 1024  matrices,  the  time  spent 
transferring  data  to  and  from  the  GPU  is  negligible  compared  to  the  time  spent 
performing  multiplication  and  accumulation  calculations.  Further,  calculation  time  is 
dominated  by  memory  accesses  within  the  GPU  because  the  GPU  architecture  requires 
frame  buffer  memory  accesses  for  both  accumulation  operations  and  for  copying  results 
from  the  frame  buffer  back  into  a  texture  [LaMOl], 
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Though  the  results  of  [LaMOl]  are  not  entirely  compelling  from  a  performance  or 
practical  standpoint  (recall  the  GPU’s  8-bit  limitation),  it  represents  a  starting  point  for 
discussion  because  its  techniques,  observations  and  recommendations  are  recurring 
themes  in  subsequent  research. 

First,  to  be  useful  in  most  scientific  or  engineering  computing  applications,  the  GPU 
should  be  capable  of  handling  at  least  32-bit  floating  point  numbers  [LaMOl].  This 
limitation  has  in  fact  been  overcome  by  recent  generations  of  GPU,  which  now  support 
32-bit  processing  throughout  the  entire  pipeline  [MoA03][KrW03][Nvi04], 

Second,  accumulating  results  between  rendering  passes  requires  multiple  memory 
accesses  within  the  GPU,  whereas  a  CPU  can  store  intermediate  results  in  fast  registers. 
So,  future  GPU  architectures  should  include  persistent  registers  for  this  purpose 
[LaMOl].  Unfortunately,  current  GPU  hardware  still  does  not  provide  this  capability 
[BFH04],  Further,  though  the  memory  bandwidth  of  current  GPUs  is  almost  five  times 
faster  than  those  of  three  years  ago,  the  integration  of  32-bit  floating  point  support  offsets 
this  bandwidth  improvement  because  more  memory  accesses  per  pixel  must  be  made. 
This  is  confirmed  in  [Mor03],  where  a  GPU  with  32-bit  functionality  multiplied  two 
order- 1000  floating  point  matrices  in  just  over  0.5  seconds,  almost  exactly  the  same  time 
required  by  the  older-generation  GPU  operating  on  8-bit  data. 

In  addition  to  the  above,  there  are  other  ways  to  increase  GPU  performance  [LaMOl]: 
up  to  four  numbers  may  be  packed  into  a  single  pixel  by  setting  the  red,  green,  blue  and 
alpha  channels  to  different  values;  lowering  the  refresh  rate  of  the  monitor  could  yield  a 
10%  performance  improvement;  running  full  screen  versus  in  a  window  increases 
performance;  and  using  ABGR  EXT  versus  RGBA  texture  formatting  in  OpenGL  can 
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improve  performance  by  40%,  since  it  eliminates  time-consuming  re-reformatting  within 
the  GPU. 

The  technique  of  texture  blending  in  the  fixed-function  GPU  pipeline  has  been  used  to 
do  finite  element  [RuSOl],  and  Lattice  Boltzmann  [LWK03]  computations  on  GPU 
hardware. 

The  GPU  Programmable  Pipeline 

The  three  years  following  the  work  of  [LaMOl]  brought  significant  improvements  to 
GPU  architecture.  8-bit  fixed  point  has  been  replaced  with  IEEE  32-bit  floating  point 
representation  for  each  of  the  four  color  components  in  each  pixel  [KrW03].  GPU 
internal  memory  bandwidth  increased  by  a  factor  of  four,  and  clock  speed  increased  by  a 
factor  of  two.  But  the  most  significant  advance  with  respect  to  GPU  general  purpose 
computing  is  the  move  toward  a  programmable  architecture.  GPUs  now  contain 
programmable  vertex  and  fragment  processors.  Each  processor  respectively  executes  a 
user-specified  assembly-level  vertex  or  pixel  shader  program  consisting  of  4-way  SIMD 
instructions  that  perform  standard  math  operations,  such  as  3-  and  4-component  dot 
product,  addition  and  multiplication  on  large,  parallel  streams  of  data.  Instructions  for 
texture  fetching  and  other  special-purpose  instructions  are  also  available.  Each  vertex  or 
pixel  fragment  to  be  processed  is  placed  in  a  set  of  read-only  input  registers.  The  shader 
program  is  executed  next  and  the  results  written  to  a  set  of  output  registers.  The  shader 
program  performs  an  implicit  loop,  executing  over  all  the  elements  of  a  stream 
[THO02]  [BFH04] . 
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Pixel  Shaders  versus  Vertex  Shaders 


Pixel  shaders  have  been  used  for  matrix-vector,  vector-vector  and  matrix-matrix 
multiplication,  and  for  2D  Fast  Fourier  Transforms  [Mor03]  [KrW03]  [MoA03]. 

Matrices  are  represented  as  a  set  of  diagonal  vectors  inside  a  2-dimensional  texture  to 
facilitate  efficient  processing  of  banded  diagonal  matrices  [KrW03].  A  more 
straightforward  approach  breaks  column  vectors  into  smaller,  four-element  sub-columns, 
and  stores  each  sub-column  as  a  texture  pixel,  placing  the  four  individual  elements  into 
the  R,  G,  B  and  A  components  of  the  pixel  [Mor03].  Despite  differing  methods  for 
packing  data  into  textures,  all  exploit  the  4-tuple  parallelism  of  texture  pixels  to  achieve 
four  32-bit  calculations  per  pixel  for  each  SIMD  shader  instruction.  Below,  is 
justification  for  using  texture  fragments  versus  vertices  as  the  GPU  data  format  of  choice 
[Mor03]: 

Textured  geometry  is  preferable  because  of  the  more  compact  representation  when  compared  with 
highly  tessellated  geometry  with  vertex  colors.  Also,  unlike  geometry,  textures  can  also  be  output  by  the 
GPU  in  the  form  of  render  target  surfaces.  If  we  store  a  matrix  as  a  texture,  and  then  perform  a  matrix 
operation  such  as  matrix  addition  by  rendering  two  textures  with  additive  blending  into  a  third  render  target 
surface,  the  storage  format  of  the  resulting  matrix  can  be  identical  to  the  input  format.  This  is  a  desirable 
property  because  this  way  we  can  immediately  reuse  the  resulting  texture  as  an  input  to  another  operation 
without  having  to  perform  format  conversion. 

A  notable  exception  to  the  above  approach  develops  a  framework  for  general-purpose 
GPU  computing  based  on  vertex  shader  programs,  as  opposed  to  pixel  (texture-  or 
fragment-based)  shaders  [THO02],  The  reasoning  behind  this  choice  is  primarily 
motivated  by  the  state  of  GPU  technology,  which  at  the  time  offered  higher,  16-bit 
precision  for  vertex  operations  versus  only  10  bits  for  texture  operations,  and  a  more 
robust,  21 -opcode  instruction  set  for  vertex  shaders.  The  framework  itself  is  discussed 
later;  however,  there  are  several  weaknesses  in  using  vertex  shaders,  some  of  which  have 
since  been  addressed  by  later  GPU  designs  [THO02], 
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First,  the  results  of  vertex  programs  cannot  be  stored  directly  into  a  GPU  memory 
buffer  without  first  passing  through  the  GPU  pipeline  and  being  converted  to  pixels. 
Then-current  GPUs  represented  pixels  with  only  8-bit  precision.  Though  internal  vertex 
computations  are  carried  out  with  16-bit  precision,  a  significant  precision  loss  is  realized 
when  the  result  is  retrieved  as  8-bit  pixels. 

Second,  program  size  is  limited  to  128  instructions,  and  branching  and  logical 
Boolean  operations  are  not  supported.  Such  restrictions  required  awkward  hand-coded 
programming.  For  example,  loops  had  to  be  “unrolled”,  and  the  number  of  loops  is 
limited  by  the  maximum  instruction  count.  This  limitation  applies  to  both  vertex  and 
pixel  shaders  [THO02], 

Lastly,  there  is  no  way  to  share  data  between  multiple  vertex  program  invocations. 
Though  vertex  programs  provide  at  least  96  registers  for  holding  intermediate  results 
within  a  program,  all  registers  are  zeroed  upon  program  termination  [THO02], 

As  has  been  discussed  previously,  precision  is  no  longer  an  big  issue,  since  32-bit 
floating  point  is  supported  by  some  models  of  GPU.  Also,  published  specifications  for 
the  nVidia  GeForce  6800  advertise  hardware  support  for  pixel  and  vertex  shader 
programs  of  “unlimited”  length,  plus  support  for  branching  within  pixel  shader  programs, 
with  the  caveat  that  the  operating  system  and  API  may  impose  limits  on  program  length, 
even  though  the  hardware  does  not  [Nvi04],  Further,  Microsoft’s  High-Level  Shading 
Language  (HLSL)  now  supports  branching  and  looping  in  pixel  and  vertex  shader 
programs  [Msd04],  Despite  these  advances,  GPU  hardware  still  does  not  provide 
persistent  registers  for  vertex  programs  or  a  means  to  store  the  results  of  vertex 
operations  without  rendering  to  pixels.  Theremfore,  most  recent  GPU-based 
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implementations  use  pixel  shaders  which  operate  on  data  stored  as  textures,  and  maintain 
state  between  rendering  passes  by  saving  results  to  off-screen  texture  memory  buffers 
(a.k.a.  render  target  textures)  [MoA03][Mor03][KrW03][BFH04],  Older  versions  of 
pixel  shader  were  subject  to  clamping,  whereby  color  intensities  were  restricted  to  values 
between  zero  and  one.  Much  effort  has  been  devoted  find  a  way  to  convert  between  real 
values,  represented  as  floats,  to  numbers  that  fit  within  the  required  [0,1]  range,  such  as 
in  GPU-based  finite  element  analysis  [RuSOl]  and  refractive  caustics  [TrSOl].  It  is  less 
complicated  now  since  subsequent  versions  of  HLSL,  with  Pixel  Shader  version  2.0  or 
later,  support  the  full  floating  point  range  [Mor03].  The  abundance  of  applications  based 
on  pixel  shaders  seems  to  indicate  the  pixel  shader  instruction  set  has  caught  up  with  the 
vertex  shader  in  terms  of  flexibility,  leaving  little  incentive  to  use  vertices  for 
computation.  However,  vertices  are  still  used  in  most  of  these  applications  for  setting  up 
the  shape  and  size  of  the  area  to  be  rendered. 

Frameworks,  Models  and  Compilers  for  General  Purpose  GPU  Computing 
In  the  examples  described  thus  far,  getting  a  GPU  to  perform  general  purpose 
computing  required  extensive  knowledge  of  graphics  hardware  and  graphics 
programming,  down  to  the  assembly  language  level  in  many  cases,  on  the  part  of  the 
programmer.  Such  programming  is  tedious  and  error-prone,  and  best  managed  by  a 
compiler  [THO02],  In  fact,  several  languages  now  exist  that  allow  shader  programs  to  be 
written  in  a  high-level,  C-like  programming  language  [BFH04],  including  Microsoft’s 
High-level  Shading  Language  (HLSL),  nVidia’s  Cg,  and  the  OpenGL  Shading  Language 
[BFH04],  While  a  step  in  the  right  direction,  these  languages  are  still  graphics-oriented, 
and  require  a  programmer  to  express  algorithms  and  data  structures  in  terms  of  graphics 
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primitives,  such  as  textures  and  triangles  [BFH04],  Therefore  they  fall  short  of  providing 
an  environment  for  generalized  stream  computing  on  the  GPU. 

There  has,  however,  been  some  research  devoted  to  this.  One  example,  alluded  to 
previously,  presents  a  framework  with  abstractions  for  expressing  vectors,  and  functions 
for  operating  on  vectors,  on  a  GPU.  This  framework  defined  a  DFunction  class  which 
allows  unary,  binary  and  ternary  functions,  with  vector  operands  and  scalar  or  vector 
outputs,  to  be  defined.  The  D  Vector  class  works  behind  the  scenes  to  allocate  an 
OpenGL  p-buffer  in  GPU  memory  to  accumulate  results,  thus  shielding  the  programmer 
from  the  intricacies  of  graphics  programming  [THO02], 

Similarly,  [KrW03]  devised  a  stream  model  for  operating  on  vectors  and  matrices  and 
it  defined  clVec  and  clMat  container  classes  for  expressing  vectors  and  matrices 
respectively.  Upon  initialization,  vectors,  originally  stored  as  C++  arrays,  are  converted 
to  textures  in  the  GPU  and  bound  to  texture  handles.  The  class  instance  keeps  track  of 
the  texture  handles  and  sizes  associated  with  its  respective  matrix  or  vector,  and  makes 
that  information  available  through  public  functions.  Arithmetic  is  performed  via  the 
clVecOp  function,  with  an  enumerator  op  to  select  addition,  multiplication,  or  subtraction 
operations.  The  setting  of  op  selects  a  corresponding  pixel  shader  program  to  perform 
the  operation  on  the  two  input  textures. 

An  important  operation  in  graphics  processing  is  reduction  [KrW03],  Reduction  is  an 
operation  that  condenses  or  evaluates  all  data  in  a  stream  to  produce  a  smaller  subset  or  a 
single  value.  Examples  include  summing  all  elements  of  a  matrix  to  produce  a  single 
scalar,  or  finding  the  element  with  the  minimum  or  maximum  value.  GPU  hardware  does 
not  yet  provide  efficient  means  for  accomplishing  reduction  operations  [KrW03] 
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[BFH04],  Reduction,  therefore,  requires  multiple  rendering  passes  to  accomplish.  To 
sum  all  of  the  elements  in  a  matrix,  for  example,  a  pixel  shader  program  could  render  a 
quadrilateral  with  dimensions  half  those  of  the  original  matrix,  placing  into  the  elements 
of  the  new  matrix  the  sum  of  four  adjacent  pixels  from  the  original.  The  pixel  shader 
executes  recursively,  operating  on  previous  results,  producing  a  quarter-sized  texture 
each  iteration.  The  final  result  is  a  single  pixel  containing  the  desired  sum.  Figure  2-2 
illustrates  this  concept.  This  reduction  algorithm  operates  in  0(log(n))  time,  where  n  is 
the  dimension  of  the  original  matrix  [KrW03].  Of  course,  the  number  of  reduction  passes 
required  can  be  reduced  if  the  number  of  neighboring  pixels  summed  on  each  pass  is 
increased  [BFH04], 


Figure  2-2.  Reduction  operation  achieved  with  GPU  in  successive  rendering  passes,  summing  groups  of 
four  adjacent  pixels  in  a  texture  and  rendering  to  a  quarter-sized  render  target  texture  in  each  pass. 

Researchers  at  Stanford  University  went  a  step  further  than  the  examples  above  by 
creating  a  language  and  compiler  for  stream  computing  on  graphics  hardware,  called 
Brook.  Brook  manages  memory  via  streams,  data  objects  containing  collections  of 
records.  Parallel  functions,  called  kernels,  invoke  parallel  operations  on  streams  in  the 
GPU.  Reduction  functions  similar  to  those  described  above  are  also  provided.  The 
Brook  system  consists  of  two  parts,  brcc  a  source-to-source  compiler,  and  the  Brook 
Runtime  (BRT),  a  library  of  runtime  support  routines  for  kernel  execution.  The  compiler 
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maps  Brook  kernels  into  Cg  shaders,  which  are  subsequently  compiled  into  GPU 
assembly  by  commonly  available  vendor-provided  compilers,  brcc  also  produces  C++ 
code  which  uses  BRT  to  invoke  the  kernels.  Originally  developed  as  a  language  for 
streaming  processors  such  as  Stanford’s  Merrimac  streaming  supercomputer  and  the 
Imagine  processor,  Brook  has  been  adapted  for  use  on  the  GPU,  supports  both  OpenGL 
and  DirectX,  and  is  freely  available  [BFH04], 

GPU  Performance 

The  GPU  does  not  generally  operate  in  the  same  address  space  as  the  host  CPU, 
therefore,  an  analysis  of  GPU  performance  must  not  only  consider  computation  time,  but 
also  the  time  spent  transferring  data  into  and  out  of  the  GPU.  This  concept  is  captured  in 
the  metric  computational  intensity,  the  ratio  of  the  total  cost  of  executing  an  algorithm  on 
a  device  versus  the  cost  of  transferring  the  data  into  and  out  of  the  device  [BFH04],  For 
an  application  to  effectively  use  the  GPU,  it  must  possess  the  following  two  key 
properties  [BFH04]: 

First,  in  order  to  outperform  the  CPU,  the  amount  of  work  performed  must  overcome  the  transfer  costs 
which  is  a  function  of  the  computational  intensity  of  the  algorithm  and  the  speedup  of  the  hardware. 
Second,  the  amount  of  work  done  per  kernel  call  should  be  large  enough  to  hide  the  setup  cost  required  to 
issue  the  kernel. 

In  [LaMOl],  two  order- 1024  matrices  were  multiplied  in  0.54  seconds,  including 
data  transfer  time,  but  there  was  a  one-time  0.2  second  set-up  cost.  Such  an  application 
would  obviously  not  be  a  suitable  candidate  for  GPU  acceleration  unless  many  more 
matrices  are  to  be  multiplied. 

Setting  up  kernels  or  shader  programs  on  a  GPU  requires  a  fixed  amount  of  CPU  time. 
If  multiple  kernel  calls  are  executed  back-to-back,  the  setup  time  can  overlap  with  the 
kernel  execution.  If  the  streams  are  large,  the  GPU  will  be  the  limiting  factor,  but  if 
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streams  are  small,  it  may  not  be  possible  to  issue  kernel  calls  fast  enough  to  keep  the 
GPU  busy  [BFH04], 

The  performance  of  Brook-compiled  GPU  applications  has  been  compared  against 
optimized  and  well-known  CPU  benchmarks,  as  well  as  hand-coded  or  GPU  vendor- 
provided  versions  of  the  applications  optimized  for  a  particular  GPU.  In  addition,  both 
DirectX  and  OpenGL  configurations  have  been  tested  on  the  most  capable  GPUs 
available  in  early  2004,  the  ATI  X800XT  and  nVidia  GeForce  6800,  providing  a  fairly 
complete  and  current  evaluation  of  general-purpose  GPU  computing  capability.  The  test 
applications  are  linear  algebra,  FFT,  and  ray  tracing  [BFH04], 

For  linear  algebra,  two  low-level  subroutines  from  the  ATLAS  Basic  Linear  Algebra 
Subprograms  (BLAS)  library  are  emulated,  SAXPY  and  SGEMV.  SAXPY  performs  a 
vector  scale  and  sum  operation,  y  =  ax  +  y,  and  SGEMV  performs  a  matrix- vector 
product  followed  by  a  scaled  vector  add,  y  =  aAx  +  J3y  where  x  and  y  are  vectors,  A  is  a 
matrix  and  a  and  /?  are  scalars.  Vector  length  is  1024  and  matrices  10242  single¬ 
precision  floating  point.  For  the  CPU  benchmarks,  the  commercial  Intel  Math  Kernel 
Library  is  used  for  SAXPY,  BLAS  for  SGEMV,  and  FFTW-3  for  the  FFT  [BFH04], 

In  most  of  the  trials  the  hand-coded,  optimized  GPU  reference  applications  ran 
slightly  faster  than  the  Brook-compiled  versions.  Generally,  the  ATI  card  outperformed 
the  nVidia  card  by  a  wide  margin,  almost  by  a  factor  of  four  in  the  worst  case.  This  is 
possibly  due  to  higher  floating  point  texture  bandwidth  on  the  ATI  card,  about  4.5 
Gfloats/second,  versus  nVidia’s  1.2  Gfloats/sec.  Peak  compute  performance  of  ATI  and 
nVidia  was  40  billion  and  33  billion  multiplies  per  second  respectively.  Generally, 
DirectX  outperformed  OpenGL  since  DirectX  can  render  directly  to  a  texture,  whereas 
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with  OpenGL,  an  additional  copy  operation  is  required  to  transfer  the  contents  of  the  p- 
buffer  into  a  texture  [BFH04], 

The  ATI  card  running  the  reference  GPU  application  under  DirectX  executed  SAXPY 
about  eight  times  faster  than  the  CPU  version,  achieving  about  4.9  GFLOPS.  Brook 
running  under  the  same  circumstances  achieved  a  7x  improvement  over  the  CPU  version. 
In  contrast,  the  nVidia  card’s  best  performance  for  this  application  was  only  1.5 
GFLOPS,  about  2.4-times  improvement  over  the  CPU  version.  For  the  reason  noted 
earlier,  OpenGL  versions  generally  achieved  only  half  the  performance  of  the  DirectX 
versions.  For  SGEMV,  the  ATI  card  under  DirectX  provided  about  a  1.7x  increase  in 
performance  over  the  CPU,  and  the  nVidia  card  actually  ran  slower  than  the  CPU.  For 
the  FFT  application,  the  ATI  card  performance  matched  that  of  the  CPU,  and  the  nVidia 
card  achieved  about  0.7  the  performance  of  the  CPU  [BFH04], 

In  the  case  of  the  ATI  card  under  DirectX,  the  GPU  either  exceeded  or  matched  the 
CPU-based  applications.  Even  more  encouraging  is  that  the  CPU  benchmarks  were 
optimized  to  make  very  efficient  use  of  the  CPU  cache  structure  [Mor03][BFH04],  which 
means  that  the  GPU  would  most  likely  provide  even  greater  performance  gains  versus 
non-optimized  C++  applications.  For  instance,  without  its  cache  optimization,  the 
effective  performance  of  the  CPU-based  FFT  application  FFTW  would  be  cut  by  over  80 
percent,  making  the  GPU  version  a  full  six  times  faster  by  comparison  [BFH04],  It 
would  certainly  be  beneficial  if  cache  optimizations  could  be  applied  in  the  programming 
of  GPUs.  Unfortunately,  the  order  pixels  are  processed  within  the  GPU  is  an 
undocumented  implementation  detail,  which  makes  it  difficult  to  exploit  data  locality  in 
the  same  manner  as  is  routinely  done  in  CPU  programming  [Mor03], 
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For  the  SGEMV  application,  the  GPU  beat  the  CPU,  but  not  by  as  wide  a  margin 
as  in  other  applications.  This  is  most  likely  because  SGEMV  involves  a  vector-matrix 
multiplication,  requiring  a  multi-pass  reduction  step.  If  GPU  hardware  is  equipped  with 
the  persistent  registers  necessary  to  facilitate  single-pass  reductions,  performance  could 
be  significantly  enhanced.  For  instance,  computing  the  sum  of  220  32-bit  floats  took 
approximately  0.79  milliseconds  on  an  ATI/DirectX  platform,  compared  to  14.6 
milliseconds  on  an  optimized  CPU  implementation.  While  this  is  good,  it  is  estimated 
that  such  an  operation  would  only  take  0.18  milliseconds  were  the  GPU  hardware  to 
provide  support  for  such  reductions  [BFH04], 

Conclusion 

Some  have  envisioned  supercomputing  may  one  day  be  conducted  on  clusters  of 
inexpensive  PCs  equipped  with  multiple  high-performance  graphics  cards  versus  multiple 
CPUs  [TH002][Mor03].  The  power  of  the  modern  GPU  is  indeed  impressive,  and  it  is 
becoming  increasingly  easier  to  harness  that  power  for  general-purpose  computing.  With 
respect  to  the  JMASS  requirement,  some  of  the  examples  in  the  literature  are  directly 
applicable.  For  example,  time-domain  convolution  has  been  accomplished  more 
efficiently  by  the  common  technique  of  first  performing  an  FFT  on  two  images, 
multiplying  them  element-by-element  in  the  frequency  domain,  then  performing  an 
inverse  FFT  on  the  result  [MoA03],  JMASS  similarly  requires  an  element-by-element 
multiplication  of  two  matrices,  an  operation  that  can  be  trivially  accomplished  with  a 
pixel  shader  program  [MoA03],  After  multiplying  the  rotated  reticle  image  with  the  IR 
scene,  JMASS  requires  that  all  elements  be  summed  to  produce  a  single  luminance  value. 
Such  reduction  operations  were  considered  in  [KrW03]  and  [BFH04],  and  it  has  been 


26 


shown  that  they  can  be  accomplished  faster  on  a  GPU.  Of  the  operations  required  by 
JMASS,  only  the  rotation  operation  seems  to  have  no  direct  parallel  in  the  literature.  The 
GPU  does  provide  built-in  means  for  mapping  textures,  via  indexed  lookups,  to 
transformed  (including  rotated)  polygons  [THO02],  making  it  likely  that  the  GPU  can  be 
used  for  accelerating  the  JMASS  rotation  operation.  However,  to  do  so  the  GPU  must 
implicitly  perform  an  interpolation  on  the  original  data.  How  best  to  implement  these 
operations  on  a  GPU,  and  whether  the  GPU  can  deliver  acceptable  levels  of  accuracy  will 
certainly  be  subjects  of  this  research. 
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III.  GPU  Implementation  and  JMASS  Integration 
Integration  With  JMASS 

For  the  GPU  to  be  of  any  help  to  JMASS,  JMASS  must  change  the  way  it  processes 
reticle  images.  Recall  from  Chapter  I  baseline  JMASS  generates  properly  oriented  reticle 
images  to  multiply  with  the  IR  scene  by  rotating  and  interpolating  a  static  reticle  image 
template.  These  image  rotation  and  interpolation  operations  are  performed  forty  times 
per  model  time  step,  for  a  total  of  100,000  times  each  during  the  simulation  of  a  10- 
second  engagement.  A  more  efficient  approach,  proposed  herein,  is  to  store  a  set  of  pre¬ 
rotated  and  interpolated  reticle  images  of  the  required  size  in  memory  (either  in  the  GPU 
or  in  CPU  main  memory),  and  to  look  them  up  when  needed  versus  generating  them 
repeatedly  through  costly  transformation  operations  throughout  the  execution  of  the 
simulation.  Integrating  GPU  processing  into  JMASS  essentially  requires  this  sort  of 
approach  to  capitalize  on  the  GPU’s  fast  texture  memory  and  to  limit  costly  data  transfers 
between  CPU  and  GPU.  Even  if  the  GPU  is  not  used,  such  a  lookup-based  approach  is 
much  more  efficient  because  it  effectively  eliminates  hundreds  of  thousands  of  0(N2 ) 
image  rotation  and  interpolation  operations. 

AFIWC  accepted  this  proposal  and  produced  a  modified  version  of  JMASS  which 
implements  a  lookup-based  approach  for  reticle  images.  A  set  of  100  incrementally 
rotated  images,  spanning  a  complete  rotation  of  a  reticle,  is  sufficient  to  replace  the 
continuously  variable  rotations  of  the  baseline  approach.  Prior  to  simulation  start, 
modified  JMASS  generates  this  set  of  100  pre-processed  reticle  images,  then,  depending 
on  whether  or  not  the  GPU  is  being  used,  either  uploads  them  to  the  GPU,  or  stores  them 
in  CPU  main  memory  for  later  use  in  the  simulation. 
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The  only  image  processing  operation  remaining  to  be  performed  by  the  GPU, 
therefore,  is  the  reticle-scene  multiply-add  operation.  This  consists  of  performing  an 
element-by-element  multiplication  of  two  arrays  (the  scene  and  reticle  images),  and  a 
summation  of  the  result  to  produce  a  single  output  value.  This  operation  is  performed  on 
forty  reticle-scene  image  pairs  during  each  simulation  time  step  after  each  IR  scene 
update.  Upon  each  scene  update,  the  new  scene  image  is  uploaded  to  the  GPU.  For  spin 
scan  simulations,  the  scene  is  multiplied  by  forty  consecutive  reticle  images  (out  of  the 
100),  each  with  a  slightly  greater  rotation  than  the  next,  and  the  forty  results  are  returned 
to  JMASS.  For  conical  scan,  the  reticle  images  are  called  for  in  a  random-access  fashion, 
such  that  the  forty  that  are  used  may  not  be  consecutive,  or  may  even  repeat.  The  conical 
scan  approach  additionally  requires  the  reticle  and  scene  images  be  shifted  with  respect  to 
each  other  by  specified  amounts  prior  to  the  multiply-add  operation,  and  the  shift  can  be 
different  for  each  of  the  forty  reticle  images  used  in  the  time  step. 

GPU  Implementation 

Before  attempting  to  implement  the  JMASS  multiply-add  operation  on  a  GPU,  several 
design  choices  had  to  be  made,  starting  with  the  graphics  cards.  First  and  foremost,  the 
graphics  cards  need  to  support  the  IEEE-754  floating  point  format.  At  the  time  of  this 
writing  only  two  graphics  cards  meet  this  requirement,  the  ATI  X800XT  and  the  nVidia 
6800  Ultra.  Though  it  is  possible  that  other  exotic  and  far  more  expensive  graphics  cards 
exist  with  similar  or  better  features,  these  cards  were  chosen  because  they  represent  the 
top  of  the  line  available  to  consumers,  and  because  their  GPU  clock  speed  and  feature 
sets  are  directly  comparable.  A  second  important  requirement  is  the  graphics  cards  have 
sufficient  on-board  memory  to  support  the  storage  of  the  100  reticle  images,  plus  the 
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input  scene  and  several  textures  for  storing  intermediate  results  between  rendering  passes. 
The  256MB  capacity  of  these  cards  was  adequate  in  most  cases.  A  final  necessity  with 
respect  to  the  graphics  cards  is  they  must  support  Pixel  and  Vertex  Shader  version  2.0  or 
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better  because  the  reduction  operations  require  dependent  texture  addressing  ,  which  is 
not  fully  supported  in  previous  versions.  For  the  graphics  API,  DirectX  was  chosen  over 
OpenGL  because  it  provides  quicker  mechanisms  for  retrieving  data  from  the  GPU 
[BFH04],  Shader  programs  were  written  in  Microsoft’s  High-Level  Shader  Language 
(HLSL)  versus  assembly  language  for  the  sake  of  simplicity.  The  code  for  controlling 
the  GPU  and  interfacing  it  with  JMASS  was  written  in  C++  to  facilitate  easier  integration 
with  JMASS,  which  is  also  written  in  C++.  This  code  is  included  in  Appendix  B. 

Theory  of  Operation 

The  GPU  interface  is  instantiated  as  an  object,  with  methods  for  uploading  reticle 
images  and  for  processing  scene  images.  For  spin  scan,  JMASS  calls  the  GPU. process 
method,  sending  as  parameters  references  to  both  the  scene  array  and  an  array  for  storing 
the  forty  returned  results,  plus  the  starting  reticle  image  index  for  the  consecutive 
sequence  of  reticles  to  multiply  with  the  scene.  For  the  conical  scan  implementation, 
JMASS  identifies  the  indices  of  the  forty  reticle  images  to  use,  and  provides  forty  sets  of 
x-y  offsets  for  shifting  them  with  respect  to  the  scene. 

Due  to  the  high  degree  of  programmability  and  rich  feature  set  offered  by  the  graphics 
cards  and  DirectX  API,  there  are  many  ways  to  implement  the  multiply-add  operation  on 
a  GPU.  For  this  research,  two  approaches  were  explored  for  organizing  the  computations 
within  the  GPU. 


3  Dependent  texture  addressing  allows  texture  coordinates  which  address  one  texture  pixel  to  be  used  to 
derive  the  coordinates  for  another. 
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Sequential  Approach 

The  first  is  called  the  “Sequential”  approach.  Figure  3-1  shows  a  step-by-step 
progression  of  this  algorithm.  “Step  0”  consists  of  uploading  the  100  pre-processed 
reticle  images  into  GPU  memory  and  storing  each  reticle  image  as  a  separate  texture. 
When  5122  images  are  used,  this  requires  about  100MB  (1MB  =  220  bytes)  of  GPU  RAM 
for  storing  the  reticle  images.  This  step  occurs  before  the  start  of  the  actual  JMASS 
simulation.  Once  the  simulation  is  started,  JMASS  calls  the  GPU. process  method  during 
each  simulation  time  step  after  a  new  IR  scene  is  generated.  As  shown  in  Figure  3-1, 
GPU  processing  takes  place  in  three  steps.  During  the  first  step,  the  new  IR  scene  is 
uploaded  into  GPU  memory.  In  the  second,  multiply  and  add  step,  the  scene  is  multiplied 
(element-by-element)  with  forty  consecutive  reticle  textures,  producing  a  sequence  of 
forty  new  result  images.  Further,  blocks  of  four  adjacent  pixels  are  summed,  producing 
result  images  that  are  a  quarter  the  size  of  the  original  scene  and  reticle  images.  The  forty 
result  images  are  rendered  one  at  a  time  into  a  single,  large  texture  in  GPU  memory, 
arranged  so  as  to  fill  five  rows  of  eight  images.  At  this  point,  the  reticle  and  scene 
images  have  been  multiplied,  but  their  elements  have  only  been  partially  summed.  These 
intermediate  results  are  stored  in  a  single  large  texture.  Step  three,  called  the  reduce  step, 
completes  the  summation  operation  by  successively  rendering  from  one  intermediate 
result  texture  into  another  sixteenth-size  texture,  summing  blocks  of  16  adjacent  pixels. 
After  two  or  three  such  reduction  operations,  depending  on  the  initial  size  of  the  reticle 
and  scene  images,  the  final  result  texture  contains  forty  pixels,  with  each  pixel  containing 
the  result  of  a  corresponding  reticle-scene  multiply-add  operation.  The  forty  numbers  are 
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GPU  Memory  GPU  Memory 


Step  0.  Prior  to  simulation  start,  generate  and  step  1 .  Upload  IR  scene  image  to  GPU  memory, 

pre-load  100  reticle  images  into  GPU  memory. 


Step  2.  Multiply  and  add.  Using  GPU  programmable  pipeline,  multiply  scene  by  a  reticle 
image  element-by-element,  then  sum  blocks  of  four  adjacent  pixels,  producing  a  quarter- 
sized  result  image  (4:1  reduction).  Store  result  in  render  target  texture.  Do  for  40 
consecutive  reticles.  Render  target  contains  the  40  resulting  images  arranged  in  a  grid. 
Note:  though  not  indicated  above,  reticle  index  values  are  mod  100. 


Step  3.  Reduce.  Sum  blocks  of  16  adjacent  pixels,  producing  a 
1/16  th-sized  result  image  (16:1  reduction).  Repeat  up  to  three 
times  (depending  on  original  scene/reticle  dimensions).  End 
result  is  40  pixels,  each  a  floating  point  number  that  is  the 
desired  matrix  “dot  product”  of  the  scene  with  a  reticle.  These  are 
returned  to  JMASS.  Return  to  Step  1  to  process  next  scene. 


Figure  3-1 .  “Sequential”  approach  for  processing  JMASS  multiply-add  operation  in  the  GPU. 
retrieved  from  the  GPU  and  returned  to  JMASS,  and  the  GPU  waits  for  the  next  scene  to 
be  uploaded  (i.e.,  returns  to  “Step  1”  in  Figure  3-1).  This  “Sequential”  approach  requires 
up  to  43  rendering  passes:  40  reticle-scene  multiply  and  add  operations,  followed  by  up 
to  three  reduction  (summation)  operations. 

Palette  Approach 

A  second  method,  called  the  “Palette”  approach,  achieves  the  same  results,  but  gets 
there  by  taking  a  different  path.  Figure  3-2  provides  a  step-by-step  pictorial 
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representation  of  this  algorithm.  For  clarity,  the  general  approach  is  first  described, 
followed  by  specific  details. 

To  begin  with,  this  algorithm  stores  the  reticle  images  in  the  GPU  differently  than  the 
previously  described  approach.  Instead  of  storing  the  100  reticle  images  as  separate 
textures,  they  are  arranged  by  rows  and  columns,  like  tile,  into  one  large  “palette”  texture. 
After  the  scene  image  is  uploaded  to  the  GPU,  it  is  multiplied  with  the  larger  palette 
texture,  taking  advantage  of  a  GPU  addressing  mode  which  effectively  replicates  the 
scene  image  across  the  palette  texture  so  many  copies  of  the  scene  image  line  up  to  be 
multiplied  with  the  many  reticle  images  contained  in  the  palette  texture.  This  is  shown  in 
Figure  3-2,  in  the  diagram  for  Step  2.  In  this  manner,  the  scene  can  be  multiplied  by 
many  reticle  images  in  a  single  rendering  pass.  After  multiplying,  blocks  of  four  adjacent 
pixels  are  summed  such  that  the  resulting  texture  is  a  quarter  the  size  of  the  original 
reticle  palette  texture.  Thus,  this  first  rendering  pass  produces  a  quarter-sized  texture 
holding  the  results  of  many  reticle-scene  multiplications,  but  the  pixels  have  only  been 
partially  summed.  As  in  the  “Sequential”  approach,  the  summation  operation  is 
completed  by  performing  up  to  three  16:1  reduction  operations,  resulting  in  a  final  result 
texture  containing  forty  pixels,  from  which  the  values  are  retrieved  and  returned  to 
JMASS  (see  Figure  3-2,  Step  3).  This  approach  requires  a  maximum  of  four  rendering 
passes  to  complete. 

Now  that  the  basic  approach  has  been  presented,  some  of  the  important  details  left  out 
of  the  above  discussion  can  be  addressed.  First,  GPUs  impose  limitations  on  the 
maximum  allowable  size  for  textures.  Shader  programs  further  impose  that  textures  be 
square  and  they  have  a  power-of-two  dimension  to  use  dependent  texture  addressing. 
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Since  this  addressing  mode  is  vital  to  efficient  accomplishment  of  this  algorithm,  the 
textures  are  subject  to  all  of  the  above  constraints.  The  effect  of  these  constraints  is  it  is 
impossible  to  fit  the  complete  set  of  100  reticle  images  into  a  single  palette  texture  for  all 
but  the  smallest  supported  image  size  (128  ).  To  solve  this  dilemma,  four  different 
palettes  are  loaded  into  GPU  memory,  each  containing  a  64-image  subset  of  the  100 
reticle  images.  The  100  reticle  images  are  distributed  among  the  four  palettes  such  that, 
given  any  starting  reticle  index,  there  is  always  at  least  one  palette  which  contains  the 
next  39  required  reticles  in  a  contiguous  block.  All  that  is  required  is  some  simple  range 
checking  in  the  GPU  interface  to  ensure  the  correct  palette  is  chosen  for  the  multiply-add 
operation  based  on  the  starting  index  provided  by  JMASS.  Storing  the  four  palette 
textures  requires  64MB  of  GPU  memory  if  the  256  image  size  is  being  used.  For  the 
5 12  image  size,  however,  the  256MB  GPU  memory  capacity  is  not  large  enough  to  store 
four  reticle  palettes.  To  solve  this  problem,  a  more  complicated  three-palette  method  was 
devised,  which  consists  of  multiplying  the  scene  with  up  to  two  different  palettes, 
essentially  performing  this  algorithm  twice,  and  combining  the  results  at  the  end.  The 
three  palette  images  require  192MB  of  GPU  memory. 

Though  the  fixes  described  above  meant  the  “Palette”  approach  overcame  texture  size 
constraints,  the  approach  itself  proved  to  be  very  inefficient;  it  always  performed  64  (or 
more)  reticle-scene  multiply-add  operations,  when  only  40  are  actually  needed. 

However,  DirectX  provides  a  way  to  narrow  the  size  of  the  drawing  rectangle  so 
rendering  can  be  restricted  to  a  desired  rectangular  subset  of  a  render  target  texture.  The 
palette  approach  was  therefore  modified  to  automatically  set  the  drawing  rectangle  so  as 
to  exclude  as  much  of  the  unneeded  portions  of  the  textures  as  possible  from  being 
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processed.  Doing  so  reduced  execution  time  for  this  algorithm  by  almost  20%  compared 
to  its  original  incarnation.  Some  inefficiency  still  remains,  however,  because  this  method 
can  still  allow  up  to  eight  extraneous  images  to  be  processed.  Figure  3-3  shows  this 
remaining  inefficiency. 


Figure  3-3:  Inefficiency  can  result  in  “Palette”  GPU  algorithm  implementation.  In  this  example^  the  forty  needed- 
reticle  images  are  indices  1 2-51  in  the  Reticle  Palette.  Though  adjusting  the  drawing  rectangle  can  block  out  some  of 
the  unneeded  reticle  images  (grayed-out  strips  at  top  and  bottom),  it  cannot  block  out  those  which  are  highlighted  in 
the  checkered  pattern.  This  results  in  the  GPU  processing  eight  extra  images  that  are  not  needed  The  GPU  therefore 
takes  longer  to  process  certain  ranges  of  reticle  images  than  others.,  depending  on  the  range  given,  and  how  many  of 
the  unnecessary  images  can  be  masked  by  the  drawing  rectangle. 


Preliminary  tests  show  that  the  “Palette”  approach  works  well  on  the  smaller  image 

2  2 

sizes  (128  and  256  ),  but  the  “Sequential”  approach  may  be  the  better  of  the  two  for  the 
512  image  size.  Interesting  to  note,  despite  the  fact  that  the  “Sequential”  approach 
requires  up  to  43  rendering  passes,  and  the  “Palette”  approach  requires  only  four,  the  two 
algorithms  are  comparable  in  performance.  Further,  though  the  “Palette”  approach  works 
at  all  three  image  sizes  on  the  PCI-express  platform,  it  does  not  support  the  5 12  image 
size  on  the  AGP  platform.  For  some  reason,  perhaps  due  to  DirectX  or  graphics  card 
AGP  drivers,  the  AGP  machine  will  not  allow  more  than  one  reticle  texture  (which  is  a 
full  64MB  in  this  case)  to  be  loaded  into  GPU  memory.  Instead,  the  remaining  palette 
textures  are  forced  into  AGP  aperture  memory  (off  the  graphics  card),  causing  GPU 
processing  to  take  minutes  instead  of  seconds  to  accomplish.  The  “Sequential”  approach 
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is  most  likely  immune  to  this  limitation  because  it  does  not  require  so  many  large 
textures. 

In  both  the  “Palette”  and  “Sequential”  approaches,  the  reticle  and  scene  images  are 
stored  in  the  GPU  such  that  four  image  pixel  values  are  packed  into  each  texture  pixel, 
using  the  texture  pixel’s  four  (R,  G,  B  and  A)  color  channels  as  a  vector,  thereby  fully 
exploiting  the  four-way  parallelism  of  the  GPU.  However,  through  experimentation  it 
was  found  that  maintaining  such  packing  in  the  reduction  stage  slows  the  GPU  down,  and 
it  is  better  to  transition  to  a  single-channel  texture  format  (having  a  single  32-bit  floating 
point  R  channel  versus  the  full  4  x  32-bit  RGBA)  for  the  reduction  passes.  Doing  so 
results  in  as  much  as  1.2  lx  speedup  for  these  implementations. 

Conical  Scan 

To  support  conical  scan,  a  separate  GPU  algorithm  was  created,  based  on  the 
“Sequential”  algorithm  described  above.  The  “Palette”  approach  could  not  be  used 
because  it  cannot  process  reticle  images  out  of  order,  and  the  reticle  images,  being  part  of 
a  single,  static  texture  that  is  accessed  in  one  rendering  pass,  cannot  be  shifted  by 
differing  amounts  with  respect  to  the  scene.  Because  conical  scan  requires  shifting 
images  by  arbitrary  amounts  prior  to  multiplying  them,  the  reticle  and  scene  images 
cannot  be  packed  four-to-one  into  texture  pixels  as  they  are  for  the  spin  scan  approaches. 
Instead,  a  one-to-one  correspondence  has  to  be  maintained  between  image  and  texture 
pixels,  so  spatial  integrity  is  retained  after  shifting.  Although  this  reduces  efficiency 
somewhat,  resulting  in  slightly  higher  GPU  execution  times,  the  performance  is  still 
competitive  with  other  approaches  (see  Chapter  V).  To  implement  this  algorithm,  the 
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shader  programs  were  modified  to  add  horizontal  and  vertical  offset  values  to  the  texture 
coordinates  for  the  reticle  images  as  shown  in  Fig  3-4. 


De-fault  orientation  with  reticle  image  centered  on  larger 
scene  image.  Axes  added  so  displacement  can  be 
shown  in  next  picture  at  right. 


Conical  scan  requires  the  reticle  image  to 
be  shifted  with  respect  to  scene  before  the 
multi  ply-add  operation. 


Ay 


Reticle-scene  multiply-add  operation  performed  on  subset 
of  scene  overlapped  by  the  shifted  reticle  image. 

Figure  3-4.  Conical  scan  variation  of  the  JMASS  optics  processing.  Same  as  “Sequential77  implementation,  except  reticle  image 
must  first  be  shifted  with  respect  to  the  larger  scene  image,  by  specified  x-y  offsets,  before  performing  the  multiply-add  operation. 
The  shift  is  performed  in  the  GPU,  using  a  vertex  shader. 


A  special  feature  was  also  added  to  the  code  to  permit  scene  sizes  of  arbitrary  dimension, 
versus  the  power-of-two  and  square  shape  constraints  of  the  spin  scan  versions.  Reticle 
images,  however,  remain  bound  by  those  constraints,  but  this  is  not  a  detriment  since 
reticles  are  circular  and  hence  symmetrical  in  shape.  One  final  observation  with  respect 
to  shifting  the  images:  some  shift  amounts  can  result  in  5-10%  longer  execution  times  for 
the  GPU.  This  is  likely  due  to  cache  misses  or  address  translation  within  the  GPU.  For 
the  conical  scan  experiments,  whose  results  are  discussed  in  Chapter  V,  the  shift  amounts 
are  randomized  to  provide  reasonably  accurate  estimates  of  performance  that  can  be 
expected. 
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IV.  Methodology 


Problem  Definition 

Goals,  hypotheses  and  approach.  The  primary  goal  of  this  research  is  to  determine 
whether,  and  to  what  extent,  a  GPU  can  speed  up  JMASS  simulations.  The  image 
processing  calculations  currently  carried  out  in  the  JMASS  software  have  been  presented 
as  the  main  system  performance  bottleneck.  The  general  approach,  therefore,  is  to 
replace  the  JMASS  image  processing  software  routines  with  GPU  hardware  processing, 
and  compare  the  performance  of  the  GPU-assisted  JMASS  with  that  of  the  baseline 
JMASS  system.  It  is  expected  that  the  GPU  will  provide  some  degree  of  acceleration 
since  its  inherent  parallelism,  enhanced  memory  bandwidth  and  stream  processing 
characteristics  make  it  better  suited  to  these  tasks  than  traditional  pipelined  CPU 
processing. 

The  second  goal  of  this  research  is  to  determine  the  performance  gains  achievable 
using  a  GPU.  To  do  so  requires  testing  GPU  and  alternative  processing  methods  apart 
from  JMASS  in  a  controlled  environment.  The  resulting  experiments  represent  a  control 
group  to  be  used  as  a  basis  for  comparison  and  for  interpreting  the  results  of  the 
experiments  which  involve  JMASS.  To  accomplish  this  goal,  GPU  performance  is 
compared  with  that  of  two  CPU-based  (i.e.,  software-based)  alternatives  for 
accomplishing  the  reticle-scene  multiply-add  operation:  a  basic  C++  software 
implementation,  and  another  which  makes  use  of  a  widely  available  cache-optimized 
linear  algebra  library.  The  first  implements  the  reticle-scene  multiply-add  operation 
using  basic  C++  loop  structures,  much  like  baseline  JMASS  does;  the  second  uses  the 
cache-optimized  Intel  Math  Kernel  Library  (MKL)  sdot  command  to  perform  the 
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operation.  Throughout  the  rest  of  this  document,  the  three  implementations  are  referred 
to  as  GPU,  C++  and  MKL. 

Using  the  experimental  methodology  defined  herein,  it  is  determined  whether 
currently  available  GPUs  will  reduce  JMASS  execution  time,  and  whether  they  provide 
any  advantage  over  CPU-based  implementations.  It  is  anticipated  the  GPU  will  do  both. 
In  addition,  the  results  will  quantify  JMASS  speedup  due  to  the  GPU  and  the  maximum 
overall  speedup  achievable  by  optimizing  the  image  processing  operations.  Ultimately, 
the  results  will  guide  future  hardware  and  software  designs. 

System  Boundaries 

For  the  primary  goal  of  determining  whether  GPU  hardware  processing  can  improve 
JMASS  performance,  the  system  under  test  includes:  the  JMASS  software;  a  high- 
performance  mainstream  PC  host  with  minimal  I/O  (only  keyboard,  monitor,  mouse,  and 
disk  drive);  MS  Windows  XP  operating  system  and  the  latest  version  of  the  DirectX  API; 
a  top-of-the-line  mainstream  graphics  card;  and  a  custom-designed  C++  module  to 
control  GPU  operations  and  provide  an  interface  for  exchanging  data  between  JMASS 
and  the  GPU.  The  component  under  test  in  this  case  is  the  combination  of  the  graphics 
card  and  the  custom  interface  software. 

For  the  second  goal  of  comparing  GPU  performance  to  that  of  CPU-based  software 
alternatives,  three  system  configurations  are  used.  Each  configuration  consists  of  a  stand¬ 
alone  PC  as  defined  above,  a  simple  application  to  generate  reticle  images  and  scene 
images  (workloads),  plus  one  of  the  following  processing  methods,  described  earlier,  as 
the  component  under  test:  GPU,  C++,  or  MKL. 
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An  additional  initial  phase  of  experiments  is  conducted  prior  to  those  indicated  above 
to  select  the  best-performing  GPU  for  use  in  subsequent  experiments.  See  the 
Experimental  Design  Section,  Table  4-1  and  Figure  4-1  for  precise  details  on  system 
configurations  and  what  is  tested  in  each  experimental  phase. 

System  Services 

The  JMASS  system  simulates  the  flight  of  an  IR-seeking  missile  from  launch  to 
contact  with  the  target.  It  simulates  both  the  external  environment  and  the  missile’s 
responses  to  that  environment  with  the  behavior  of  the  simulated  missile  recorded  for 
later  analysis.  The  overall  system  service  provided  by  JMASS,  therefore,  is  to  generate 
behavioral  data  for  various  simulated  missiles  and  environments. 

This  research  focuses  on  optimizing  the  subset  of  JMASS  that  performs  the  image 
processing  calculations  which  simulate  the  optical  path  of  the  missile’s  IR  seeker.  The 
image  processing  service  receives  an  IR  scene  from  the  JMASS  environment  simulator, 
multiplies  it  element-by-element  with  rotated  versions  of  a  template  reticle  image, 
reduces  each  of  the  resulting  images  to  a  single  pixel  by  summing  all  its  pixels,  and 
returns  the  computed  values  back  to  the  JMASS  simulator.  JMASS  supports  various 
image  resolutions.  The  following  image  sizes,  in  pixels,  are  representative  of  those 
routinely  used  in  JMASS  simulations:  1282,  2562,  5122  and  10242  (for  conical  scan). 

Possible  outcomes  are  either  a  correct  or  incorrect  computation  of  results,  or  complete 
failure  to  produce  results.  Incorrect  results  would  result  if  the  GPU  algorithm  were  in 
some  way  flawed.  Possible  causes  include  improper  texture  lookup  or  interpolation  of 
texture  values.  Floating  point  truncation  is  another  possible  source  of  error.  In  baseline 
JMASS,  calculations  are  carried  out  in  double-precision  floating  point  format.  The 
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graphics  cards,  however,  are  limited  to  single -precision  IEEE-754  standard  32-bit 
floating  point  format.  Additionally,  although  the  ATI  card  supports  IEEE-754  format,  it 
uses  only  24  bits  to  represent  each  float  (16  bits  mantissa,  7  exponent),  making  the  ATI 
implementation  more  susceptible  to  truncation  error.  It  can  therefore  be  expected  the 
different  implementations  will  produce  different,  if  not  incorrect,  results.  Due  to  the 
graphics  cards’  decreased  precision,  it  is  also  possible  numeric  overflow  may  result. 
Complete  failure  to  produce  results  would  be  indicative  of  a  system  or  subsystem  failure. 

Workload 

For  those  experiments  involving  the  JMASS  system,  the  workload  consists  of  running 
an  unclassified  AFIWC-provided  test  scenario  at  each  of  three  scene/reticle  image  sizes: 
128,256  and  512  pixels.  The  specific  scenario  used  is  the  unclassified  Generic  Man- 
Portable  Air  Defense  System  (MANPADS)  Threat  Model,  set  for  a  10-second 
engagement.  This  scenario  is  representative  of  the  types  of  workload  used  in  JMASS 
simulations. 

For  the  remaining  experiments  which  do  not  involve  JMASS,  the  workload  consists  of 
test  images  representing  the  IR  scene,  submitted  repeatedly  to  the  system  for  processing. 
Since  the  GPU  executes  a  deterministic  mathematical  operation  on  known  input  data,  the 
accuracy  of  the  output  is  easily  verified,  and  the  test  data  for  these  experiments  need  not 
originate  from  JMASS.  The  workload  is  varied  by  changing  the  size  and  content  of  the 
test  images,  which  are  the  only  aspects  of  the  workload  that  can  be  changed. 

With  respect  to  the  content  of  the  workload,  three  scene  update  schemes  are  used: 
non-changing,  fully  changing,  and  moving  point  source.  The  non-changing  scheme  sends 
the  same  image  to  the  processor  every  time.  It  is  expected  that  this  scheme  will  provide 
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the  most  accurate,  “best  case”  measurement  of  execution  time,  since  it  introduces  no 
delay  between  calls  to  the  processing  method  (i.e.,  the  GPU,  C++  or  MKL  method  under 
test).  The  moving  point  source  scheme  causes  a  single,  unit-valued  pixel  to  trace  out  a 
square  path  within  the  scene  over  time,  emulating  a  JMASS  point  source  simulation. 

This  is  considered  a  “middle  of  the  road”  scheme  in  that  it  only  changes  two  pixels  of  the 
scene  upon  each  update.  Since  the  point  source  “moves”  in  a  non-sequential  way  through 
array  memory,  it  may  induce  cache-specific  behavior.  The  fully-changing  scheme  is 
intended  to  more  closely  resemble  JMASS  since  it  changes  every  pixel  in  the  scene 
image  between  calls  to  the  optics  processing  routine.  The  fully-changing  scheme  is 
accomplished  by  adding  the  value  of  1.0  to  each  pixel  upon  a  scene  update.  Since 
updating  the  scene  in  this  manner  requires  some  processing  time,  it  is  expected  that 
observed  execution  times  will  at  least  increase  by  some  uniform  amount.  A  change  that 
is  disproportionate  may  indicate  an  unexpected  interaction  of  factors. 

The  workload  uses  image  sizes  of  1282,  2562  and  5122  pixels.  Conical  scan 
experiments,  however,  use  scene  image  dimensions  twice  those  of  the  reticle  image.  For 
those  experiments,  the  reticle  image  sizes  (in  pixels)  are  1282,  2562  and  5122,  and  the 
corresponding  scene  image  sizes  are  2562,  5122  and  10242. 

Image  size  is  the  most  important  factor  of  the  workload,  since  it  directly  affects 
execution  time.  Image  content  is  important  from  the  standpoint  of  verifying  calculations 
have  been  performed  correctly,  and  that  values  do  not  exceed  the  range  of  32-bit  floating 
point  numbers.  The  scene  update  scheme,  which  periodically  alters  the  contents  of  the 
workload,  might  also  impact  performance. 
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Performance  Metrics 


Execution  time  is  the  natural  choice  for  a  performance  metric  since  this  research  is 
motivated  by  a  requirement  to  reduce  JMASS  execution  time.  An  additional  metric  is  to 
measure  the  differences,  if  any,  between  the  results  computed  by  baseline  JMASS  and 
those  produced  by  the  GPU  and  software -based  image  processing  implementations 
developed  for  this  research.  Such  deviations  are  expected  to  be  relatively  small,  resulting 
from  floating  point  truncation.  They  are  nevertheless  reported  because  it  is  unknown 
how  such  differences,  however  small,  will  affect  simulation  outcomes. 

Parameters 

Parameters  are  those  aspects  of  the  system  or  the  workload  which  could  affect  system 
performance  if  changed.  The  following  is  a  comprehensive  list  of  parameters,  and  their 
associated  levels  where  applicable.  Note  that  only  a  subset  of  these  are  actually  varied 
during  the  experiments  (see  Factors  below). 

•  System  parameters: 

o  PC  platform. 

■  Processor  type  and  speed,  cache  size 

■  Memory  and  I/O  configuration 

■  I/O  Bus  architecture 

■  Operating  system 

■  DirectX  version 
o  GPU  (graphics  card) 

o  Software 

■  JMASS  version 

■  JMASS  configuration 

■  GPU  algorithm  implementation 

■  Pixel  shader  version 

■  Image  processing  implementation 

•  Workload  parameters 

o  Image  size  in  pixels 
o  Scene  update  scheme 
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Factors 


Factors  are  those  parameters  which  are  expected  to  have  the  greatest  impact  on  system 
performance  and  so  will  be  varied  singly  or  in  combination  with  other  factors  during  the 
experiments.  The  chosen  factors,  and  their  associated  levels,  are  listed  below. 

•  System  factors: 

o  PC  platform. 

■  Bus  architecture:  AGP  or  PCI-express 

o  GPU  (graphics  card):  ATI  Radeon  X800XT  or  NVidia  GeForce  6800 
o  Software 

■  JMASS  configuration:  Baseline,  Modifed  JMASS  (Software),  Modified  JMASS 
(GPU-assisted) 

■  Image  processing  implementation:  GPU,  non-optimized  software  (C++),  or 
cache-optimized  linear  algebra  library  (MKL) 

■  GPU  algorithm  implementation:  Palette  versus  Sequential 

•  Workload  factors 

o  Image  size:  1282,  25  62  and  5122  pixels  IR  scene  and  reticle  images 
o  Scene  update  scheme:  non-changing,  fully-changing,  moving  point  source 

The  factor  expected  to  cause  the  greatest  performance  variation  is  the  image  size 
(workload)  factor.  This  is  because  increasing  image  dimensions  exponentially  increases 
the  required  number  of  multiplication  and  summation  operations.  Further,  based  on 
review  of  the  literature,  there  may  be  large  performance  differences  between  cache- 
optimized  and  non-optimized  software  implementations.  The  PCI-express  bus 
architecture  doubles  the  bandwidth  for  data  transfers  between  host  and  GPU  memories, 
and  so  may  also  be  an  important  factor. 

GPU  algorithm  implementation  (Palette  or  Sequential)  represents  two  different  ways  to 
organize  the  rendering  operations  performed  by  the  GPU.  Preliminary  tests  show  the  best 
(i.e.,  fastest)  method  to  use  depends  on  the  GPU  and  image  size  being  used.  Details  of 
the  GPU  algorithm  implementation  options  are  discussed  in  Chapter  III. 
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Parameters  related  to  the  PC  platform  (with  the  exception  of  bus  architecture)  are  not 
varied  because  it  is  expected  that  AFIWC  will  simply  run  JMASS  on  the  highest- 
performance  mainstream  PC  available,  equipped  with  minimal  I/O  and  a  large  RAM 
complement.  While  such  parameters  can  certainly  affect  overall  system  performance, 
any  changes  attributable  to  them  would  be  the  same  regardless  of  whether  GPU 
acceleration  was  being  used,  so  they  are  held  constant  during  these  experiments.  Further, 
because  neither  the  JMASS  nor  GPU  processes  require  frequent  disk  access,  the  disk 
subsystem  is  not  seen  as  an  important  parameter  with  respect  to  JMASS  or  GPU 
performance. 

Evaluation  Technique 

The  evaluation  technique  is  primarily  direct  measurement  since  all  resources  are 
readily  available  for  experimentation,  and  execution  time  is  easily  measured.  Further, 
since  graphics  cards  are  proprietary  devices,  they  defy  simulation  using  standard  software 
tools  or  analytical  methods.  While  strictly  speaking  simulation  is  not  used,  the  first  three 
phases  of  experiments,  described  in  detail  in  the  following  section,  can  be  considered  an 
emulation  which  predicts  to  some  degree  how  the  GPU  would  perform  if  it  were 
integrated  into  JMASS.  The  results  of  such  stand-alone  subsystem  testing  can  be 
validated  by  comparing  them  with  the  results  of  the  fourth  phase  of  experiments,  which 
integrate  the  GPU  with  JMASS.  This  is  discussed  further  in  the  Analysis  of  Results 
section. 

Experimental  Design 

Experiments  are  organized  into  four  phases,  each  with  its  own  specific  purpose  and 
experimental  design.  The  first  phase  is  intended  to  compare  the  stand-alone  (separate 
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from  JMASS)  spin  scan  image  processing  performance  of  the  two  graphics  cards  under 
various  workloads,  using  two  different  GPU  algorithm  implementations,  and  to  select  the 
configuration  which  yields  the  best  performance  for  use  in  subsequent  experiments.  In 
this  first  phase  of  experiments,  JMASS  is  not  used.  Instead,  the  two  graphics  cards,  the 
ATI  X800XT  and  the  nVidia  6800  Ultra,  are  treated  as  stand-alone  subsystems  which 
emulate  the  JMASS  image  processing  function.  Since  the  nVidia  card  was  not  available 
in  a  PCI-express  version,  only  the  AGP  versions  of  the  cards  are  compared.  Each 
replication  of  an  experiment  consists  of  submitting  a  test  image  (IR  scene)  to  the  graphics 
card  for  processing  1,000  times  and  measuring  the  total  execution  time.  Execution  times 
were  measured  using  calls  to  the  Windows  C++  timeGetTime()  command,  which  returns 
the  value  of  the  system  clock  with  one-millisecond  resolution.  Running  1,000  iterations 
of  the  GPU  algorithm  ensures  execution  time  results  well  above  1  millisecond  for  all 
experiments.  Running  the  GPU  algorithm  1,000  times  is  roughly  equivalent  to  the 
amount  of  optics  processing  performed  by  JMASS  to  simulate  4  seconds  of  a  missile’s 
flight.  The  factors  (levels)  varied  in  this  set  of  experiments  are:  graphics  card  (NVidia, 
ATI);  image  size  (1282,  2562,  5122);  GPU  algorithm  implementation  (Palette, 
Sequential);  and  scene  update  scheme  (non-changing,  fully-changing).  However, 
because  the  “Palette”  GPU  implementation  does  not  run  correctly  on  the  AGP  platform  at 
the  5122  resolution,  the  subset  of  experiments  involving  the  5122  image  size  are  analyzed 
separately  to  prevent  skewing  the  results. 

Two  experimental  designs  are  used  in  this  phase  of  experiments.  The  first,  involving 
the  1282  and  2562  image  sizes,  is  a  2 V  full-factorial  experimental  design  using  the  k  =  4 
factors  listed  above  and  r  =  30  replications.  In  this  phase,  and  in  Phases  Two  and  Three 
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Platform:  PC  with  AGP  bus 
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(a)  System  configurations  for  Phase  1  experiments.  Spin  scan  implementation. 
Compares  performance  of  ATI  and  nVidia  GPUs  under  various  workloads,  scene 
update  schemes  and  GPU  algorithm  implementations. 


Configuration  A 


Configuration  B 
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and 
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Software  Image  Processing  Implementation 


Software  Image  Processing  Implementation: 
MKL 


(b)  Three  system  configurations  for  the  Phase  2  experiments.  Spin  scan  implementation.  GPU 
image  processing  performance  compared  to  software-based  implementations  under  different 
workloads  and  scene  update  schemes,  on  both  AGP  and  PCI- express  platforms. 


Figure  4-1.  System  configurations  for  each  phase  of  experiments.  Figure  continues  on  next  page. 
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(c)  Three  system  configurations  for  the  Phase  3  experiments.  Conical  scan  implementation.  GPU 
image  processing  performance  compared  to  software-based  implementations  under  different 
workloads  and  scene  update  schemes,  on  both  AGP  and  PCI- express  platforms. 
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Modifed  JMASS 
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(d)  System  configurations  for  Phase  4  experiments.  JMASS  baseline  performance 
compared  to  Modified  JMASS  and  GPU-assisted  JMASS  performance. 


Figure  4-1  (continued).  System  configurations  for  each  phase  of  experiments. 


51 


Table  4-1.  Experimental  designs  for  all  phases  of  experiments. 
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(b)  Phase  1  b  experimental  design.  Spin  scan.  ATI  vs.  nVidia.  Image  size  =  512.  Algorithm  =  Sequential. 
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(c)  Phase  2  experimental  design.  Spin  scan.  GPU  vs.  Software-based  Processing  Methods. 
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(d)  Phase  3  experimental  design.  Conical  scan.  GPU  vs.  Software-based  Processing  Methods.  Non-changing  scene  update  scheme. 
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(e)  Phase  4  experimental  design.  Baseline  vs.  Modified  JMASS  Software  and  GPU-assisted  versions. 
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as  well,  each  experiment  was  repeated  30  times,  back  to  back.  The  second  experimental 
design  separately  tests  the  512  image  size  case.  Though  also  full-factorial,  there  are  only 
two  factors  that  can  be  varied:  GPU  and  scene  update  scheme. 

The  full-factorial  design  tests  all  possible  combinations  of  the  factors,  and  identifies 
the  configuration  that  yields  the  best  performance.  Replication  provides  more  samples 
than  single  trials  and  allows  the  estimation  of  experimental  error.  Knowing  experimental 
error  is  advantageous  because  it  isolates  the  error  attributable  to  unknown  sources  from 
the  error  produced  by  the  factors  under  test.  Therefore,  confidence  intervals  can  be 
calculated  for  the  effects  and  provide  a  qualitative  indicator  of  the  validity  of  the 
experimental  design.  There  are  2 k(r  -1)  =  464  degrees  of  freedom  in  the  mean  squared 
error  calculations  for  the  first  design,  and  1 16  for  the  second  design.  Since  there  are 
greater  than  30  degrees  of  freedom,  confidence  intervals  for  the  effects  are  determined 
using  quantiles  of  the  unit  normal  distribution.  There  are  20  total  experiments  in  this 
phase.  Table  4-1  (a  and  b)  provides  templates  for  this  experimental  design,  Figure  4- 1(a) 
shows  the  system  configurations  used  in  the  experiments. 

In  the  second  phase  of  experiments,  the  best-performing  GPU  from  the  first  phase  is 
retained  to  be  tested  against  both  C++  and  MKL  software  implementations  of  the  JMASS 
spin  scan  reticle-scene  multiply-add  operation.  The  intent  of  this  phase  is  to  compare 
GPU  hardware-accelerated  image  processing  performance  with  that  achievable  using 
software.  In  these  experiments,  JMASS  is  not  used.  Instead,  the  three  implementations 
are  treated  as  stand-alone  subsystems  which  emulate  the  JMASS  image  processing 
function.  A  test  workload  is  submitted  for  processing  1 ,000  times,  and  the  total 
execution  time  measured.  The  experimental  design  is  four-factor,  full  factorial  with 
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replication.  The  factors  (levels)  used  in  these  experiments  are:  processing  method  (GPU, 
C++,  MKL);  bus/platform  (AGP,  PCI-express);  scene  update  scheme  (non-changing, 
fully-changing,  moving  point  source);  and  image  size  (128 , 256 , 512  ).  Each 
experiment  is  conducted  30  times,  providing  1566  degrees  of  freedom  in  the  mean 
squared  error  calculations.  Since  there  are  more  than  30  degrees  of  freedom,  confidence 
intervals  for  the  effects  are  determined  using  quantiles  from  the  unit  normal  distribution. 
There  are  54  total  experiments  in  this  phase.  Table  4- 1(c)  provides  a  template  for  this 
experimental  design,  Figure  4- 1(b)  shows  the  system  configurations  used  in  the 
experiments. 

A  third  phase  of  experiments  compares  the  performance  of  the  GPU  against  CPU- 
based  approaches  for  performing  the  conical  scan  variation  of  the  JMASS  image 
processing  calculations  on  both  AGP  and  PCI-express  platforms  at  three  reticle  image 
sizes:  128  ,  256  ,  and  512  .  For  these  experiments  the  non-changing  scene  update 
scheme  is  used,  and  the  experiments  are  broken  into  three  subsets  according  to  reticle 
image  size,  and  analyzed  as  three  separate  designs  (for  rationale,  see  Chapter  V).  Each 
design  is  two-factor,  full-factorial,  with  the  following  factors  (levels):  processing  method 
(GPU,  C++,  MKL)  and  bus/platform  (PCI-express,  AGP). 

As  in  the  previous  phases,  each  experiment  consists  of  running  the  algorithm  1 ,000 
times  and  measuring  the  total  execution  time,  and  30  replications  were  accomplished  for 
each  experiment.  Conical  scan,  however,  requires  more  input  parameters:  a  list  of  40 
reticle  indices  and  shift  offsets.  These  were  chosen  at  random  prior  to  each  experiment 
using  the  C++  rand  command.  A  different  seed  was  used  for  each  experiment,  drawn 
from  a  uniform  distribution  between  zero  and  4,294,967,295.  The  seeds  were  generated 
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using  the  Matlab  rand  command.  Conical  scan  also  requires  the  scene  image  be  larger 
than  the  reticle  image.  For  these  experiments,  the  scene  image  dimensions  were  chosen 
to  be  twice  those  of  the  reticle  image,  resulting  in  the  scene  having  four  times  the  number 
of  pixels  as  the  reticle.  As  in  the  previous  phases,  confidence  intervals  for  the  effects  are 
based  on  the  unit  normal  distribution.  There  are  18  total  experiments  in  this  phase.  Table 
4- 1(d)  provides  a  template  for  this  experimental  design,  Figure  4- 1(c)  shows  the  system 
configurations  used  in  the  experiments. 

In  the  fourth  and  final  phase  of  experiments,  the  two  graphics  cards  are  integrated  with 
JMASS,  and  the  performance  of  GPU-assisted  JMASS  is  compared  to  that  of  baseline 
JMASS.  The  intent  of  this  set  of  experiments  is  to  determine  whether,  and  to  what  extent 
GPU  hardware  acceleration  can  speed  up  JMASS  simulations.  Experimental  design  in 
this  case  is  a  two  factor,  full  factorial  experiment  without  replication  (i.e.,  each 
experiment  was  conducted  once). 

The  first  factor  is  the  JMASS  software  version,  consisting  of  the  following  four  levels: 
baseline  JMASS,  Modified  JMASS  (Software),  Modified  JMASS  using  the  nVidia  card 
for  acceleration,  and  Modified  JMASS  using  the  ATI  card.  The  last  two  levels  are  also 
referred  to  as  “GPU-assisted  JMASS”  throughout  this  document.  Modified  JMASS 
(Software)  is  an  improved  version  of  baseline  JMASS,  implementing  a  lookup  based 
approach  for  the  reticle  images  that  eliminates  the  rotation,  resizing  and  interpolation 
operations  (refer  to  the  beginning  of  this  chapter,  and  in  Chapter  V,  for  more  details). 
Modified  JMASS  (Software)  processes  the  reticle-scene  multiply-add  operations  in 
software,  analogous  to  the  “C++”  implementation  in  the  Phase  Two  experiments.  The 
GPU-assisted  version  of  Modifed  JMASS  is  the  same  as  Modified  JMASS  (Software), 
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except  the  reticle-scene  multiply-add  operation  is  performed  in  GPU  hardware.  The 
second  factor  is  image  size,  with  the  same  three  levels  used  in  other  experiments:  128  , 
2562,  5122. 

Configuring  JMASS,  integrating  the  GPU  code  and  interpreting  the  JMASS  results 
required  the  assistance  of  JMASS  subject  matter  experts.  Only  one  replication  was 
performed  for  each  experiment  because  the  experiments  take  so  long  to  run  (almost  two 
hours  for  the  5 12  case),  and  because  running  them  required  travel  to  an  out-of-state 
contractor  facility,  so  the  time  for  conducting  the  experiments  was  limited.  Though  this 
single-replication  experimental  design  precludes  conducting  an  analysis  of  variance,  it 
should  be  sufficient  for  the  purposes  of  estimating  likely  speedup  resulting  from  GPU 
acceleration.  There  are  12  total  experiments  in  this  phase.  Table  4- 1(e)  provides  a 
template  for  this  experimental  design,  Figure  4- 1(d)  shows  the  system  configurations 
used  in  the  experiments. 

Analysis  of  Results 

The  full-factorial  designs  described  above  permit  a  comprehensive  analysis  of  results. 
Because  the  effects  of  processors  and  workloads  interact  in  a  multiplicative  fashion 
[Jai91],  a  multiplicative  model,  using  a  log-transform  of  the  execution  time  results  is  used 
for  the  analyses.  The  analysis  model  assumes  that  errors  in  the  experimental  results  are 
normally  distributed,  and  there  is  no  trend  in  variance  with  respect  to  mean  responses. 
Normal  quantile-quantile  plots  of  the  errors,  and  plots  of  the  errors  with  respect  to  the 
mean  responses  are  therefore  used  to  validate  use  of  the  multiplicative  model  where 
applicable.  The  mean  effects  of  all  factors  and  their  associated  levels  are  computed,  and 
a  90%  confidence  interval  given  for  each.  The  same  applies  for  all  possible  combinations 
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(i.e.,  interactions)  of  factors/levels.  The  mean  effects  for  each  level  of  a  factor  are  used  to 
determine  the  average  relative  speedup  (or  slowdown)  that  results  when  one  level  is 
chosen  over  another.  Effects  that  are  statistically  significant  have  confidence  intervals 
that  do  not  include  zero.  An  effect  whose  confidence  interval  contains  the  mean  of 
another  effect  indicates  statistically  identical  performance.  In  such  cases,  increasing  the 
number  of  replications  or  decreasing  the  confidence  level  (narrowing  the  confidence 
interval)  may  permit  the  effects  to  be  distinguished. 

In  addition  to  determining  the  confidence  intervals  for  all  effects,  each  factor,  and 
interaction  of  factors,  is  examined  to  determine  its  contribution  to  the  total  variation  of 
results  (a.k.a.  “allocation  of  variation”).  Those  factors  which  contribute  the  most  to  the 
total  variation  generally  have  the  greatest  practical  impact  on  performance.  The 
statistical  significance  of  each  factor  can  be  further  verified  by  performing  an  analysis  of 
variance,  or  “F-tesf ’,  using  a  90%  confidence  level.  For  a  factor  to  be  considered 
significant,  its  contribution  to  the  variance  of  results  must  exceed  that  of  the  estimated 
experimental  error  for  the  respective  degrees  of  freedom. 

For  the  first  phase  of  experiments,  which  compares  the  performance  of  the  two 
graphics  cards  under  various  workloads  using  the  two  GPU  algorithm  implementations, 
the  analysis  techniques  described  above  are  used  to  determine  which  factors/levels  have 
the  greatest  impact  on  performance.  In  addition,  the  two  graphics  cards  are  contrasted  to 
determine  which  provides  the  best  average  performance. 

For  the  second  and  third  phase  of  experiments,  which  compare  GPU  performance  to 
software-based  alternatives  for  accomplishing  the  JMASS  spin  scan  and  conical  scan 
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procedures,  a  similar  analysis  is  performed  to  determine  significant  factors,  and  to 
contrast  the  performance  of  the  three  implementations. 

For  the  fourth  phase,  which  tests  GPU-assisted  JMASS  (using  both  ATI  and  nVidia 
cards)  against  baseline  JMASS  and  Modified  JMASS  (Software),  the  performance  of  the 
four  implementations  is  contrasted  to  determine  whether,  and  to  what  extent,  GPU 
acceleration  improves  JMASS  performance.  Additionally,  using  insight  from  the 
previous  phases  of  experiments  and  Amdahl’s  performance  equation,  it  is  possible  to 
determine  the  maximum  JMASS  speedup  achievable  with  GPU  acceleration. 

Summary 

The  experiments  described  herein  are  designed  to  definitively  address  the  research 
goal,  which  is  to  determine  whether,  and  to  what  extent  GPU  hardware  acceleration  can 
be  used  to  improve  JMASS  execution  time.  In  addition,  the  results  of  this  research 
provide  valuable  insight  as  to  how  GPU  algorithm  implementations,  scene  update 
schemes  and  bus  technologies  affect  GPU  performance  in  the  accomplishment  of  certain 
general-purpose  computing  tasks.  Finally,  the  comparison  of  GPU-accelerated  image 
processing  performance  with  that  of  non-optimized  code  and  code  using  a  cache- 
optimized  linear  algebra  library  will  likely  provide  AFIWC  new  alternatives  for 
optimizing  the  existing  JMASS  software,  even  if  the  GPU  fails  to  be  a  good  option. 
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V.  Results  and  Analysis 
Introduction 

Experiments  are  conducted  in  four  phases.  In  all  but  the  fourth  phase,  where  tests 
were  actually  run  using  JMASS,  experiments  consist  of  calling  the  optics  processing 
algorithm  1,000  times  and  recording  the  total  execution  time.  Recall  that  a  single 
iteration  performs  forty  reticle-scene  multiply-add  operations  (equivalent  to  performing  a 
dot  product  on  forty  pairs  of  vectors,  with  each  vector  containing  the  same  number  of 
elements  as  there  are  pixels  in  the  scene  or  reticle  image),  and  returns  the  forty  results  in 
an  array  back  to  the  calling  application.  It  follows  that  for  1,000  iterations,  each 
experiment  results  in  40,000  reticle-scene  multiply-add  operations.  For  reference,  this  is 
the  amount  of  optics  processing  that  occurs  in  JMASS  during  a  simulated  4-second 
engagement.  In  those  experiments  involving  a  GPU,  the  execution  time  includes  both  the 
GPU  processing  time,  plus  the  time  spent  transferring  data  into  and  out  of  the  GPU.  Each 
experiment  was  repeated  30  times.  Results,  analysis  of  variance  and  allocation  of 
variation  for  each  phase  are  included  in  Appendix  A.  Analysis  was  performed  using  a 
log-transform  of  the  execution  time  results  (cf.,  Chapter  IV).  An  analysis  of  each  phase 
follows. 

Phase  One  Experiments:  ATI  Versus  nVidia 

This  phase  compares  the  performance  of  the  nVidia  and  ATI  graphics  cards  executing 
the  JMASS  spin  scan  optics  calculations.  Since  a  PCI-version  of  the  nVidia  card  was  not 
available,  only  the  AGP  versions  of  the  cards  are  compared.  The  test  platform  for  these 
experiments  was  a  3.0  GHz  P4  (HT)  with  875P  chipset  and  1GB  RAM,  running 
Windows  XP  Professional  (SP2)  and  DirectX  9.0c.  The  factors  varied  were:  GPU 
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(nVidia,  ATI),  image  size  (1282,  2562,  5122),  GPU  algorithm  implementation  (“Palette” 
and  “Sequential”),  and  scene  update  scheme  (non-changing,  completely-changing). 
However,  because  the  “Palette”  GPU  implementation  does  not  run  correctly  on  the  AGP 
platform  at  the  5 12  resolution,  the  subset  of  experiments  involving  the  5 12  image  size 
are  analyzed  separately  to  prevent  skewing  the  results.  For  this  subset,  there  are  only  two 
factors:  GPU  and  scene  update  scheme. 

For  the  subset  of  experiments  involving  128  and  256  image  sizes,  analysis  of 
variation  (Table  A- la)  indicates  that  all  of  the  effects  and  interactions,  except  for  the 
interaction  between  algorithm  and  scene  update  scheme,  are  statistically  significant.  This 
is  due  to  the  small  amount  of  variance  in  the  experimental  results — the  graphics  cards 
seem  to  be  very  consistent  in  their  execution  times — resulting  in  90%  confidence 
intervals  that  are  orders  of  magnitude  smaller  than  the  mean  effects  in  most  cases. 
Analysis  of  variance  for  the  512  image  size  experiments  (Table  A-lb)  yields  similar 
results.  Except  for  a  few  outliers,  normal  quantile-quantile  plots  of  the  errors  (Part  1  of 
Figures  A- la  and  b)  are  reasonably  linear,  satisfying  the  analysis  model  constraint  that 
errors  be  normally  distributed.  Plots  of  errors  versus  mean  response  indicate  no  trend  in 
variance  with  respect  to  response,  satisfying  the  remaining  model  constraint  (Part  2  of 
Figures  A- la  and  b).  F-test  results  for  2kr  designs  indicated  statistically  significant 
results  with  respect  to  experimental  error  [Jai91]  Although  the  statistical  F-test  is  not 
discussed  further  in  this  research,  it  was  in  fact  passed  in  all  cases  of  interest:  the  ratio  of 
the  mean-square  value  of  any  given  effect  to  the  mean-squared  error  is  generally  greater 
than  1,000  (mean-squared  error  is  on  the  order  of  10~6  in  all  experiments),  which  is  much 
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greater  than  any  F-distribution  percentile  for  the  ratio,  given  the  relatively  large  degrees 
of  freedom  of  the  error  compared  to  the  effects. 

Though  most  of  the  effects  and  interactions  are  statistically  significant,  only  a  few 
turned  out  to  be  of  practical  importance.  For  the  128  and  256  experiments,  allocation  of 
variation  (Table  A- la)  indicates  that  over  61%  of  the  variation  is  attributable  to  the 
choice  of  GPU,  35%  to  image  size,  and  nearly  2%  to  an  interaction  between  GPU 
algorithm  and  image  size.  Each  of  the  remaining  effects  and  interactions,  including  GPU 
algorithm  and  scene  update  scheme,  account  for  less  than  1%  to  the  total  variation,  and 
are  unimportant  for  practical  purposes.  For  the  5 12  subset  of  experiments,  almost  100% 
of  the  variation  is  due  to  choice  of  GPU.  The  effects  of  the  scene  update  scheme  factor 
and  its  interaction  with  the  GPU  account  for  much  less  than  1%  of  the  total  variation,  and 
so  are  unimportant.  Since  varying  the  scene  update  scheme  made  little  difference  in 
these  experiments,  the  examples  and  discussion  that  follow  only  address  the  non¬ 
changing  scene  update  scheme  case.  The  non-changing  scene  update  scheme  carries  out 
no  processing  between  calls  to  the  GPU,  resulting  in  execution  times  that  more  purely 
reflect  the  actual  GPU  processing  time. 

Effect  of  GPU 

Performance  in  these  experiments  for  all  the  image  sizes  was  most  dramatically 
affected  by  the  choice  of  GPU.  On  average,  the  ATI  card  performs  4.8  times  faster  than 
the  nVidia  card  when  processing  1282  and  2562  image  sizes.  This  can  be  derived  from 
the  analysis  results  for  these  experiments  shown  in  Table  A- la,  where  the  mean  effect  of 
the  GPU  is  shown  to  be  -0.3424.  Since  a  multiplicative  model  using  a  log-transformation 
of  the  data  is  used,  this  figure  means  the  ATI  card  performs  the  experiment  in  10  0  3424  = 
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0.45  the  time  of  the  average  GPU,  given  an  average  image  size,  scene  update  scheme  and 
GPU  algorithm  implementation.  Similarly,  the  nVidia  GPU  requires  10  0  3424  =  2.2  times 
the  execution  time  of  the  average  GPU  under  average  conditions  to  accomplish  the  same 
calculations.  Since  the  execution  time  of  the  mean  GPU  is  1  /  0.45  =  2.2  times  that  of  the 
ATI  card,  and  the  nVidia  execution  time  is  2.2  times  that  of  the  mean  GPU,  the  nVidia 
execution  time  is  therefore  2.2  =  4.8  times  that  of  the  ATI  card,  on  average. 

A  similar  approach  can  be  used  to  compare  the  two  graphics  cards  for  the  5 12  image 
size  case.  Per  Table  A-2a,  the  mean  effect  with  respect  to  choice  of  GPU  indicates  the 
ATI  card  provides  a  full  5.0  times  speedup  over  the  nVidia  card  for  the  JMASS  optics 
calculations  at  the  512  resolution. 

One  may  intuitively  validate  the  above  GPU  comparisons  by  simply  using  the  (non- 
transformed)  mean  execution  times  to  do  a  case-by-case  comparison  of  the  GPUs.  Table 
5-1  shows  the  mean  execution  times  for  the  Phase  One  experiments,  and  Figure  5-1 
shows  the  same  information  graphically.  Dividing  the  nVidia  execution  time  by  the  ATI 
execution  time  for  a  given  image  size  and  algorithm  implementation  yields  speedup  (ATI 
over  nVidia)  in  the  range  of  3.84  -  6.98.  Speedup  figures  for  all  applicable  combinations 
of  image  size  and  GPU  algorithm  appear  in  Table  5-2. 

In  these  experiments,  the  ATI  card  was  consistently  and  significantly  faster  than  the 
nVidia  card.  Such  a  performance  disparity  between  these  particular  cards  was  noted  by 
[BFH04]  and  is  most  likely  attributable  to  the  fact  that  the  nVidia  card  carries  out  floating 
point  operations  at  full  IEEE-754  floating  point  precision,  while  ATI  card  does  not. 
Though  the  ATI  card  supports  the  IEEE-754  format,  it  only  implements  24  of  the 
required  32  bits  per  float  (16  bits  matissa,  7  for  exponent).  ATI  therefore  likely  trades 
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Table  5-1.  GPU  execution  time,  in  seconds,  for  performing  1,000  iterations  of  the  JMASS  spin  scan  optics 
processing  calculations,  or  equivalently,  performing  a  dot  product  on  40,000  pairs  of  vectors  whose 
dimension  is  indicated  in  the  Image  Size  column.  Times  shown  are  for  all  applicable  combinations  of 
GPU,  image  size,  and  GPU  algorithm  combinations,  using  non-changing  scene  update  scheme. 
Accompanying  figure  shows  the  same  information  graphically.  The  ATI  card  is  faster  (up  to  7x)  than  the 
nVidia  card  in  all  cases.  For  the  ATI  card,  the  “Palette”  approach  provides  slightly  improved  times  over 
the  “Sequential”  approach  at  1282  and  2562  image  sizes.  For  the  nVidia  card,  the  “Palette”  approach  was 
best  for  the  1282  size,  and  the  “Sequential”  approach  was  best  for  the  2562  image  size.  For  the  5122  image 
size,  only  the  “Sequential”  approach  is  used  because  the  “Palette”  approach  does  not  work  correctly  on  the 
AGP  platform. 


GPU  Execution  Time  (seconds)  Spin  Scan  Procedure 

ATI  NVIDIA 

GPU  algorithm  GPU  Algorithm 


Image  Size 

Palette 

Sequential 

Palette 

Sequential 

1282 

0.640 

0.870 

2.456 

4.216 

2562 

2.059 

2.199 

14.377 

10.124 

5 1 22 

NA 

7.226 

NA 

37.427 

Figure  5-1.  Graphical  depiction  of  the  data  in  Table  5-1,  comparing  nVidia  and  ATI  GPU  execution  times 
for  the  three  image  sizes,  using  Palette  and  Sequential  GPU  algorithm  implementations.  The  Palette 
approach  provides  slightly  better  performance  over  the  Sequential  approach  for  the  ATI  card  at  the  1282 
and  25 62  image  sizes. 
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Table  5-2.  Comparison  of  ATI  and  nVidia  graphics  cards  showing  relative  speedup  provided  by  ATI  over 
nVidia  for  executing  the  JMASS  spin  scan  image  processing  calculations.  ATI  is  faster  than  nVidia  in  all 
cases. 

ATI  Speedup  over  nVidia 

Image  Size 
1282 
25 62 
5122 

speed  for  precision,  and  this  is  reflected  in  the  accuracy  of  computed  results.  While 
testing  the  JMASS  algorithm  on  the  ATI  GPU,  a  result  (of  an  element-by-element 
multiplication  of  two  images,  then  summation)  on  the  order  of  10  can  fall  short  of  the 
correct  answer  by  as  much  as  0.016%  due  to  floating  point  truncation.  The  nVidia  card  is 
more  accurate,  yielding  error  about  one-twentieth  that  of  ATI.  The  impact  of  this  error 
on  JMASS  simulations  is  discussed  later  in  this  chapter. 

Effect  of  image  size 

For  the  subset  of  experiments  involving  the  128  and  256  image  sizes,  the  256  case 
took,  on  average,  3.3  times  more  time  to  execute  than  the  128  case.  Note  that  although 
the  workload  increases  by  a  factor  of  four  when  moving  from  the  1282  to  the  2562  image 
size,  the  execution  time  increases  by  a  lesser  factor,  indicating  the  GPU  performs  better 
when  the  larger  image  size  is  used,  on  average. 

Effect  of  GPU  algorithm  implementation 

For  the  subset  of  experiments  involving  the  1282  and  2562  image  sizes,  the  GPU 
algorithm  interacts  with  the  image  size  factor  to  contribute  about  2%  of  the  total 
variation.  Though  not  very  significant  in  terms  of  overall  performance,  the  mean  effects 
for  this  interaction  (Table  A- la)  indicate  that,  on  average,  the  “Palette”  approach 
performs  better  with  1282  images,  and  the  “Sequential”  approach  performs  better  with 
2562  images.  For  the  ATI  card  specifically,  the  “Palette”  approach  performs  slightly 


GPU  algorithm 
Palette _ Sequential 


3.84 

4.85 

6.98 

4.60 

NA 

5.18 
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better  than  the  “Sequential”  approach  for  both  the  128  and  256  image  sizes.  For  the 
nVidia  card,  results  are  mixed,  with  the  “Palette”  approach  being  best  for  the  128  image 
size,  providing  a  1.7x  speedup  over  the  “Sequential”  approach,  and  the  “Sequential” 
approach  being  better  for  the  256  image  size,  providing  a  1.4x  speedup  over  the 
“Palette”  approach.  The  bottom  line  is  the  best  choice  for  the  algorithm  depends  on 
which  graphics  card  and  image  size  one  intends  to  use.  The  fact  the  two  graphics  cards 
respond  differently  to  the  two  approaches  is  most  likely  attributable  to  their  differing 
internal  architectures — the  details  of  which  are  proprietary. 

The  single  configuration  that  maximizes  performance  of  the  average  case  is  the 
“Palette”  approach  for  the  128  images,  and  the  “Sequential”  approach  for  all  others.  The 
“Palette”  approach  provides  a  1.1 5x  speedup  over  the  “Sequential”  approach  for  both 
128  and  256  image  sizes,  while  using  the  “Palette”  approach  in  conjunction  with  the 
128  image  size  (or  using  the  “Sequential”  approach  in  conjunction  with  the  256  image 
size)  provides  an  additional  1.32x  speedup  over  other  combinations. 

Useful  work  performed 

When  comparing  the  two  GPUs,  the  concept  of  “useful  work”  supplements  this 
analysis.  Useful  work  is  a  measure  of  a  processor’s  effective  rate  for  performing  the 
floating  point  calculations  required  by  the  user,  independent  of  implementation. 

Consider  that  each  reticle-scene  multiply-add  operation  requires  size2  floating  point 
multiplications,  plus  size2-l  additions,  with  size  being  the  image  width  in  pixels  (128,  256 
or  512).  Thus,  the  amount  of  useful  work  performed  in  each  experiment  is: 

useful  work  =  1 , 000  iterations  x  40  reticle-scene  op ei a ti o ns/ i tei 'a tion  xj_2  size"-!  )  FT  operations 

(FLOPS)  execution  time 
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This  metric  is  specific  to  this  application,  and  provides  insight  into  the  efficiency  and 
suitability  of  the  processing  method  under  consideration. 


Table  5-3  compares  the  useful  work  performed  by  the  two  GPUs  for  the  Phase  One 

experiments.  The  entries  in  the  table  correspond  to  the  execution  times  listed  in  Table  5- 

1.  In  viewing  the  useful  work  figures,  keep  in  mind  the  ATI  card  does  not  process  at  full 

Table  5-3.  Useful  work,  in  GFLOPS,  performed  by  the  ATI  and  nVidia  graphics  cards  at  various  image 
size  and  GPU  algorithm  combinations,  using  non-changing  scene  update  scheme.  Figures  represent  the 
number  of  useful  floating  point  calculations  performed  per  second  in  accomplishing  1 ,000  iterations  of  the 
JMASS  spin  scan  optics  processing  calculations,  or  equivalently,  performing  a  dot  product  of  40,000  pairs 
of  vectors  whose  dimension  is  indicated  in  the  Image  Size  column. 

Useful  Work  Performed  by  GPU  (GFLOPS) 


ATI  NVIDIA 

GPU  algorithm  GPU  Algorithm 


Image  Size 

Palette 

Sequential 

Palette 

Sequential 

1282 

2.0 

1.5 

0.53 

0.31 

25 62 

2.5 

2.4 

0.36 

0.52 

5122 

NA 

2.9 

NA 

0.56  i 

floating  point  precision,  so  comparing  useful  GFLOPS  between  the  two  cards  is  only 
valid  if  one  accepts  ATI’s  floating  point  limitations. 

The  ATI  card’s  best  times  for  performing  the  experiments  were  0.640,  2.059  and 
7.226  seconds  for  the  128  ,  256  and  512  image  sizes,  respectively,  achieving  rates  of 
useful  work  between  2.0  and  2.9  GFLOPS.  In  contrast,  the  nVidia  card’s  best  times  for 
these  experiments  were  2.456,  10.124  and  37.427  seconds  for  the  1282,  2562  and  5 1 22 
image  sizes,  with  corresponding  useful  work  rates  between  0.52  and  0.56  GFLOPS.  Note 
that  in  all  but  one  case,  for  any  given  GPU  and  algorithm  combination,  the  rate  of  useful 
work  increases  with  image  size,  which  is  consistent  with  the  interpretation  of 
experimental  results  presented  earlier  in  this  chapter.  Such  behavior  is  to  be  expected 
because  larger  image  sizes  result  in  a  higher  proportion  of  the  total  execution  time  being 
spent  in  actual  GPU  processing,  versus  transferring  data  into  and  out  of  the  graphics  card. 
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Applying  terminology  from  the  literature,  processing  larger  image  sizes  increases 
computational  intensity,  enabling  the  GPU  to  be  used  more  efficiently. 

Phase  Two  Experiments:  GPU  Versus  CPU-based  Implementations 
The  second  phase  of  experiments  compares  GPU  performance  with  that  of  two 
alternative,  CPU-based,  processing  methods  for  accomplishing  the  JMASS  spin  scan 
optics  calculations:  a  C++  software  implementation,  and  C++  code  using  the  cache- 
optimized  Intel  Math  Kernel  Library  (MKL).  The  factors  are:  processing  method  (GPU, 
C++,  MKL),  bus/platform  (AGP,  PCI-express),  scene  update  scheme  (non-changing, 
fully-changing,  moving  point  source),  and  image  size  (128 , 256 , 512  ).  Though  the 
scene  update  scheme  had  little  effect  in  the  first  phase  of  experiments,  which  involved 
only  the  GPUs,  the  factor  is  retained  for  this  phase  because  of  its  potential  to  affect  the 
CPU-based  implementations.  For  these  experiments,  the  ATI  card  is  used  as  the 
representative  GPU  since  it  proved  to  be  consistently  faster  than  the  nVidia  card.  For  the 
GPU  algorithm,  the  “Palette”  approach  was  used  for  the  1282  and  2562  image  sizes 
because  it  yields  slightly  better  performance  than  the  “Sequential”  approach  does  on  the 
ATI  card.  The  “Sequential”  approach  was  used  for  the  5 122  image  sizes  because  it  is  the 
only  approach  that  works  on  both  AGP  and  PCI-express  platforms  at  the  5122  resolution. 
Tests  are  conducted  on  AGP  and  PCI-express  platforms,  using  the  AGP  and  PCI-express 
versions  of  the  ATI  X800XT  graphics  card.  The  3.0  GHz  P4  machine  from  the  first 
phase  of  experiments  serves  as  the  platform  for  the  AGP  experiments,  and  a  3.6  GHz  P4 
(HT)  with  925X  chipset  and  4GB  RAM,  running  Windows  XP  Professional  (SP2)  and 
DirectX  9.0c  is  used  for  the  PCI-express  experiments.  Though  the  two  machines  do  not 
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exactly  facilitate  an  “apples-to-apples”  comparison,  it  is  shown  that  the  differences 
between  these  two  platforms  had  little  effect  on  the  execution  times  of  these  experiments. 

Consistent  with  the  previous  phase  of  experiments,  experimental  error  was  very  small, 
yielding  90%  confidence  intervals  orders  of  magnitude  smaller  than  the  mean  effects  in 
most  cases  (see  Table  A-2a).  The  effects  of  the  factors  and  their  interactions  are 
statistically  significant,  but  only  a  few  are  of  practical  importance.  Processing  method 
(GPU,  C++  or  MKL)  accounts  for  about  9%  of  the  total  variation;  image  size  accounts 
for  89%;  and  the  interaction  between  image  size  and  processing  method  accounts  for 
1.2%.  All  other  factors  and  interactions  contribute  less  than  1%  of  the  total  variation,  so 
may  be  considered  unimportant  for  practical  purposes.  Except  for  a  few  outliers,  normal 
quantile-quantile  plots  of  the  errors  (Figure  A-2a,  Part  1)  are  reasonably  linear,  satisfying 
the  analysis  model  constraint  that  errors  be  normally  distributed.  Plots  of  errors  versus 
mean  response  indicate  no  trend  in  errors  with  respect  to  response,  satisfying  the 
remaining  model  constraint  (Part  2  of  Figure  A-2a). 

Effect  of  scene  update  scheme 

As  in  the  first  phase  of  experiments,  scene  update  scheme  is  not  a  significant  factor. 
Perhaps  this  is  to  be  expected  since  the  Pentium  4  level  two  cache  (512K  on  the  AGP 
platform,  1MB  on  the  PCI-express  machine)  can  easily  hold  one  or  more  1282  or  2562 
images,  and  perhaps  one  5122  image  (PCI-machine  only),  allowing  the  fully-changing 
scene  update  scheme  to  occur  very  quickly.  From  examining  the  execution  times  (Table 
A-2a),  a  fully-changing  scene  update  scheme  does  little  more  than  add  about  3-4%  more 
time  to  each  experiment,  and  does  not  appear  to  affect  any  method,  including  the  cache- 
optimized  MKL  implementation,  any  more  than  the  others.  The  moving  point  source 
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update  scheme,  which  changes  two  scene  pixels  per  update,  produced  almost  identical 
results  to  the  non-changing  update  method.  Because  scene  update  scheme  has  little  effect 
on  performance,  the  analyses  and  examples  that  follow  only  address  the  non-changing 
case. 

Effect  of  processing  method 

In  all  cases,  the  GPU  implementation  ran  faster  than  the  MKL  and  C++ 
implementations,  providing  1.7x  and  2.5x  speedup  over  the  two,  respectively,  on  average. 
These  figures  come  from  interpreting  the  mean  effects  computed  in  Table  A-2a  in  the 
same  fashion  as  was  done  in  the  previous  phase.  For  the  smallest  image  size  tested 
(128  ),  GPU  speedup  is  less  than  this  average,  providing  only  1.2x  speedup  over  MKL, 
and  about  2x  speedup  over  C++.  This  is  the  closest  the  CPU-based  approaches  come  to 
matching  the  speed  of  the  GPU.  As  image  size  is  increased  the  gap  widens  and  the  GPU 
provides  an  increasing  performance  advantage  over  the  CPU-based  approaches.  This  is 
shown  in  Table  5-4,  which  lists  the  relative  speedup  provided  by  the  GPU  on  the  two 
platforms  at  the  three  image  sizes.  Note  the  GPU  generally  has  less  of  an  advantage  on 
Table  5-4.  Speedup  provided  by  GPU  over  CPU-based  methods. 

AGP  platform  PCI-express  Platform 


Image  Size 

C++ 

MKL 

C++ 

MKL 

1282 

2.0 

1.4 

2.1 

1.2 

2562 

2.5 

1.8 

2.4 

1.3 

5122 

3.5 

2.8 

3.0 

2.7 

the  PCI-express  machine  because  the  CPU-based  approaches  run  their  fastest  on  this 
machine,  almost  certainly  due  to  its  faster  CPU.  The  speedup  figures  are  computed  by 
dividing  the  execution  time  of  the  CPU-based  method  by  that  of  the  GPU  method  for  a 
given  image  size  and  platform.  The  complete  list  of  execution  times  for  this  phase  of 
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Table  5-5.  Execution  times,  in  seconds,  compared  for  the  GPU  and  CPU-based  processing  methods,  at  the 
three  image  sizes,  and  on  both  AGP  and  PCI-express  machines.  Times  are  for  completing  1,000  iterations 
of  the  JMASS  spin  scan  image  processing  calculations.  Figure  5-2  shows  the  same  information 
graphically. 

Comparison  of  GPU  and  CPU-based  Approaches 
Execution  Times  (seconds)  Spin  Scan  Procedure 


AGP  platform  PCI-express  Platform 


Image  Size 

GPU 

C++ 

MKL 

GPU 

C++ 

MKL 

1282 

0.640 

1.267 

0.876 

0.576 

1.199 

0.664 

2562 

2.059 

5.234 

3.776 

1.980 

4.787 

2.645 

5122 

7.226 

25.032 

20.448 

7.283 

21.885 

19.409 

(b) 

Figure  5-2.  Comparison  of  GPU  and  CPU-based  processing  methods  in  executing  the  JMASS  spin  scan 
procedure  on  the  (a)  AGP  platform,  and  (b)  PCI  platform. 
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experiments  is  listed  in  Table  A-2a.  For  convenience,  the  execution  times  also  appear  in 
Table  5-5,  and  are  shown  graphically  in  Figure  5-2. 

Effect  of  platform 

In  this  phase  of  experiments,  the  effect  of  graphics  bus  (AGP  versus  PCI-express)  is 
confounded  with  the  difference  in  CPU  speeds  and  cache  sizes  of  the  two  machines  used. 
However,  the  analysis  shown  in  Table  A-2a  indicates  that  the  bus/platform  factor 
accounts  for  barely  0.3%  of  the  total  variation,  making  the  choice  of  platform  almost 
irrelevant  in  these  experiments.  The  mean  effect  for  the  platform/bus  factor  indicates  that 
the  PCI-express  platform,  with  its  faster  graphics  bus,  CPU,  and  larger  cache,  provided  a 
1.14x  speedup  over  the  AGP  platform,  on  average.  Though  the  PCI-express  bus  provides 
a  data  path  between  the  CPU  main  memory  and  GPU  that  is  two  times  faster  than  AGP, 
only  relatively  small  improvements  in  GPU  execution  time  are  observed  on  the  PCI- 
express  machine:  1.1  lx,  1.03x,  0.99x  at  the  128  ,  256  and  512  image  sizes 
respectively.  For  the  5 122  image  size,  the  AGP  card  yielded  better  performance  than  the 
PCI-express  version,  but  only  by  a  fraction  of  a  percent.  Note  that  the  difference  in  GPU 
performance  between  AGP  and  PCI-express  platforms  diminishes  as  image  size  is 
increased.  This  is  further  evidence  that  larger  image  sizes  allow  the  GPU  to  operate  at 
higher  levels  of  computational  intensity,  thereby  reducing  platform-specific  impacts  on 
the  GPU  processing  time.  At  the  5122  image  size,  the  AGP  and  PCI-express  graphics 
cards  perform  almost  identically,  despite  the  difference  in  CPU  speed  between  the  two 
platforms.  This  demonstrates  that  the  GPU  acts  as  an  equalizer,  allowing  machines  with 
slower  CPU’s  to  perform  as  fast  (or  faster)  than  machines  with  faster  CPU’s. 
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Interaction  between  processing  method  and  image  size 

This  effect  accounts  for  little  (about  1 .2%)  of  the  total  variation,  but  illustrates  the  fact 
that  the  GPU  is  best  used  for  the  larger  image  sizes,  compared  to  the  CPU-based 
alternatives.  From  Table  A-2a  the  effects  of  various  combinations  of  method  and  image 
size  result  in  slight  penalties  for  using  the  GPU  at  the  smaller  image  sizes,  compared  to 
the  other  methods,  and  slight  gains  for  using  the  GPU  at  the  512  image  size. 

Effect  of  image  size 

The  image  size  factor  accounts  for  the  greatest  amount  of  variation  (89%)  in  this 
phase  of  experiments.  Unfortunately,  this  information  is  not  particularly  useful,  since  it 
is  known  that  successive  increases  in  image  size  represent  fourfold  increases  in  workload, 
and  execution  times  vary  widely  with  changing  image  size  in  these  experiments.  Since 
image  size  seems  to  overshadow  the  other  factors  in  this  set  of  experiments,  some  subsets 
of  the  experiments  are  analyzed  to  discover  any  trends  that  might  otherwise  have 
remained  hidden. 

The  first  subset  to  be  analyzed  considers  only  those  experiments  involving  the  non¬ 
changing  scene  update  scheme.  The  analysis  appears  at  Table  A-2b,  and  is  almost 
identical  to  that  of  the  larger  set  of  experiments,  further  confirming  that  the  various  scene 
update  schemes  have  little  effect  on  execution  time. 

If  the  above  subset  is  further  broken  down,  and  a  separate  analysis  performed  for  each 
image  size  case  (Table  A-2c),  a  trend  with  respect  to  the  bus/platform  factor  appears.  At 
the  1282  image  size,  bus/platform  accounts  for  about  6%  of  the  total  variation.  As  image 
size  increases,  this  figure  drops:  to  4%  at  2562,  and  to  less  than  1%  at  the  5122  image 
size.  This  would  seem  to  indicate  that  as  image  size  increases,  the  bus/platform  factor 
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becomes  less  significant,  on  average.  This  may  make  sense  for  the  GPU  case,  but  does 
not  make  sense  for  the  C++  and  MKL  cases,  whose  performance  is  completely  dictated 
by  platform.  A  better  interpretation  of  this  trend  arises  if  one  considers  that,  for  the  5 12 
image  size,  the  mean  is  most  influenced  by  the  MKL  and  C++  methods,  whose  execution 
times  are  both  on  the  order  of  20  seconds,  compared  to  the  GPU,  whose  execution  time  is 
on  the  order  of  seven  seconds.  In  this  case,  the  GPU  execution  time  represents  the 
greatest  deviation  from  the  mean,  and  so  should  be  expected  to  dominate  the  analysis. 
Since  GPU  performance  depends  least  on  the  bus/platform,  it  makes  sense  that  the 
bus/platform  factor  would  have  less  impact  as  image  size  increases  and  the  GPU  becomes 
more  dominant. 

For  these  subsets  of  experiments,  particularly  those  involving  the  128  and  256  image 
sizes,  the  interaction  between  bus/platform  and  processing  method  accounts  for  a  greater 
share  (2-3%)  of  the  total  variation  than  previously  observed.  This  effect  provides  the 
greatest  performance  reward  when  MKL  is  combined  with  the  faster  platform,  about  a 
l.lx  speedup  (best  case)  over  the  “average”  combination  of  platform  and  processing 
method.  This  is  simply  because  MKL  methods  run  faster  on  the  faster  CPU,  while  the 
GPU  performs  almost  the  same,  regardless  of  platform. 

Phase  Three  Experiments:  Conical  Scan 

This  phase  of  experiments  compares  the  performance  of  the  GPU  against  CPU-based 
approaches  for  performing  the  conical  scan  variation  of  the  JMASS  image  processing 
calculations  on  both  AGP  and  PCI-express  platforms.  As  in  the  previous  phases,  each 
experiment  consists  of  running  the  algorithm  1,000  times  and  measuring  the  total 
execution  time.  Each  iteration  results  in  40  reticle-scene  multiply-add  operations,  for  a 
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total  of  40,000  per  experiment.  Conical  scan,  however,  requires  more  input  parameters: 
a  list  of  40  reticle  indices  and  shift  offsets.  These  were  chosen  at  random  prior  to  each 
experiment  using  the  C++  rand  command.  A  different  seed  was  used  for  each 
experiment,  drawn  from  a  uniform  distribution  between  zero  and  4,294,967,295.  The 
seeds  were  generated  using  the  Matlab  rand  command.  Conical  scan  also  requires  that 
the  scene  image  be  larger  than  the  reticle  image.  For  these  experiments,  the  scene  image 
dimensions  were  twice  those  of  the  reticle  image,  so  the  scene  contained  four  times  the 
number  of  pixels  as  the  reticle  image.  Taking  a  cue  from  the  results  of  the  previous 
phases,  only  the  non-changing  scene  update  scheme  was  used,  and  separate  sets  of 
experiments  were  conducted  for  each  image  size. 

For  this  phase  of  experiments,  there  are  two  factors:  processing  method  (GPU,  MKL, 
C++)  and  bus/platform  (AGP,  PCI-express).  Per  the  analysis  shown  in  Table  A-3,  these 
two  factors,  and  their  interaction,  are  statistically  significant  for  all  image  sizes.  Except 
for  a  few  outliers,  normal  quantile-quantile  plots  of  the  errors  (Figure  A-3,  Part  1)  are 
reasonably  linear,  satisfying  the  analysis  model  constraint  that  errors  be  normally 
distributed.  Plots  of  errors  versus  mean  response  indicate  no  trend  in  error  with  respect  to 
response,  satisfying  the  remaining  model  constraint  (Figure  A-3,  Part  2). 

Allocation  of  variation  and  GPU  speedup  figures  for  the  three  sets  of  experiments  in 
this  phase  are  summarized  in  Table  5-6.  Note  that  for  the  experiments  involving  the  1282 
and  2562  reticle  image  sizes,  the  bus/platform  factor  accounts  for  a  much  larger 
percentage  of  the  total  variation  than  observed  in  previous  phases  of  experiments.  This  is 
explained  by  the  fact  that,  although  the  GPU  remains  faster  on  average  than  the  CPU- 
based  approaches,  it  does  so  by  a  smaller  margin  than  was  observed  in  the  spin  scan 
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experiments  (in  fact,  MKL  on  the  PCI-express  bus  is  faster  than  the  GPU  at  the  128 
image  size).  Since  the  GPU  times  are  closer  to  those  of  the  CPU-based  methods,  the 
effect  of  the  platform  is  more  apparent  in  these  experiments. 

Recall  from  Chapter  III  the  GPU  takes  longer  to  execute  the  conical  scan  procedure 

for  several  reasons.  First,  shifting  the  images  requires  that  a  less  efficient  method  be  used 

for  storing  textures  in  GPU  memory.  Second,  conical  scan  requires  random  access  to  the 

reticle  images,  forcing  the  use  of  the  “Sequential”  algorithm  implementation,  which,  on 

the  ATI  card,  is  slower  for  the  1282  and  2562  image  sizes.  Third,  extra  time  is  needed  in 

the  vertex  shader  to  add  offsets  to  texture  coordinates.  Lastly,  since  the  scene  dimensions 

are  twice  those  of  the  reticle  image  in  these  experiments,  four  times  more  scene  data  has 

to  be  uploaded  to  the  GPU  per  iteration  than  with  spin  scan.  With  all  that  extra  data 

being  uploaded  to  the  GPU,  one  might  expect  to  observe  improved  GPU  performance 

with  the  PCI-express  bus.  However,  this  is  not  the  case.  From  Table  5-6,  under 

“Speedup  of  PCI  GPU  vs.  AGP  GPU”,  it  can  be  seen  that  the  PCI-bus  provides  little 

more  speedup  for  the  GPU  than  it  did  in  previous  phases  of  experiments. 

Table  5-6.  Summary  of  GPU  performance  versus  that  of  the  CPU-based  methods  for  the  conical  scan 
procedure.  Allocation  of  variation  for  the  effects  of  method  and  platform/bus  are  shown  to  indicate  the 
relative  importance  of  each  factor  as  image  size  is  increased. 


Average 

GPU  Speedup  Speedup  of  Speedup  of  Allocation  of  Variation  (%) 

Over  PCI  Platform  of  PCI  GPU  Interaction  of 


Reticle  Size  C++ 

MKL 

over  AGP 

vs.  AGP  GPU 

Method  Platform 

Platform  &  Method 

1282 

1.3x 

0.9x 

1.3x 

1.13x 

34 

53 

13 

2562 

2.3x 

1.9x 

1.3x 

1.05x 

86 

11 

3 

5122 

2.5x 

2.2x 

1 .  lx 

1.02x 

99 

1 

0 

The  disadvantages  described  above  seem  to  apply  most  to  the  two  smaller  image  sizes. 
However,  consistent  with  previous  experiments,  the  relative  speedup  provided  by  the 
GPU  increases  as  image  size  is  increased,  such  that  at  the  5122  image  size,  the  GPU 
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Table  5-7.  Execution  times,  in  seconds,  compared  for  the  GPU  and  CPU-based  Processing  Methods,  at  the 
three  image  sizes,  and  on  both  AGP  and  PCI-express  machines.  Times  are  for  completing  1,000  iterations 
of  the  JMASS  conical  scan  image  processing  calculations.  For  these  experiments,  the  scene  contains  four 
times  the  number  pixels  as  the  reticle  image.  Accompanying  figures  show  the  same  information 
graphically. 


Execution  Times  (seconds)  Conical  Scan  Procedure 


AGP  platform  PCI-express  Platform 


Reticle  Size 

GPU 

C++ 

MKL 

GPU 

C++ 

MKL 

1282 

1.138 

1.505 

1.414 

1.012 

1.221 

0.942 

2562 

3.037 

7.971 

6.688 

2.889 

5.796 

4.650 

5122 

10.639 

27.759 

24.235 

10.430 

24.920 

22.256 

(a) 


(b) 

Figure  5-3.  Conical  scan  execution  times  compared  for  the  GPU  and  CPU-based  Processing  Methods,  at 
the  three  image  sizes,  and  on  both  (a)  AGP,  and  (b)  PCI-express  platforms.  Plots  show  same  information 
contained  in  Table  5-7  above. 
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provides  a  sizeable  2.2x  speedup  over  MKL,  and  2.5x  speedup  over  basic  C++  for  the 
conical  scan  procedure. 

Generally,  all  the  methods  were  slower  with  conical  scan  than  they  were  with  spin 
scan.  Compare  the  execution  times  for  these  experiments,  shown  in  Table  5-7,  with  those 
of  the  Phase  Two  experiments.  For  the  C++  approach,  extra  calculations  are  needed  per 
pixel  to  index  into  the  subset  of  the  scene  array  overlapped  by  the  reticle.  MKL  does  not 
appear  to  provide  an  efficient  means  for  performing  the  required  operations  on  subsets  of 
matrices,  so  instead  of  performing  a  single  MKL  sdot  operation  on  the  two  images,  the 
MKL  routine  computes  starting  indices  for  each  row  accessed  in  the  scene  array, 
performs  a  dot  product  on  each  row  of  the  reticle  and  scene  subset,  and  accumulates  the 
results.  This  approach  was  still  faster  than  basic  C++,  but  by  a  smaller  margin  than 
observed  in  previous  experiments. 

For  the  128  and  256  image  sizes,  the  interaction  between  method  and  bus/platform 
accounts  for  about  13%  and  3%  of  the  total  variation  respectively,  diminishing  to  below 
1%  for  the  5122  case.  As  in  the  Phase  Two  experiments,  the  effects  of  the  various 
combinations  of  method  and  platform  are  explained  by  the  fact  that  the  GPU  provides  a 
higher  margin  of  performance  gain  over  the  CPU-based  methods  when  combined  with 
the  slower  platform,  and  the  opposite  generally  holds  true  when  the  CPU-based  methods 
are  combined  with  the  faster  processor.  The  impact  of  this  interaction  diminishes  with 
increasing  image  size  because  total  variation  becomes  dominated  by  the  GPU,  and  GPU 
performance  is  little-affected  by  platform. 
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Phase  Four  Experiments:  GPU  Performance  With  JMASS 


This  phase  of  experiments  compares  the  performance  of  baseline  JMASS  to  that  of 
GPU-assisted  JMASS  in  running  an  actual  JMASS  simulation,  specifically  the  JMASS 
generic  Man-Portable  Air  Defense  System  (MANPADS)  threat  model,  set  for  a  10 
second  engagement,  at  the  three  image  sizes.  Only  the  spin  scan  case  was  tested  because 
integrating  the  GPU  code  for  conical  scan  required  extensive  modifications  to  JMASS. 

Two  versions  of  JMASS  were  used,  baseline  JMASS  and  modified  JMASS.  Baseline 
JMASS  is  the  version  currently  used  by  AFIWC,  and  the  target  of  this,  and  other, 
hardware  acceleration  efforts.  During  each  simulation  time  step,  it  performs  costly  image 
processing  computations  in  software  to  simulate  the  missile’s  optical  path:  reticle  image 
rotation  and  interpolation,  and  a  reticle-scene  multiply-add  operation.  Modified  JMASS 
improves  upon  baseline  JMASS  by  switching  to  a  lookup-based  approach  for  the  reticle 
images,  effectively  eliminating  thousands  of  repetitive  rotation  and  interpolation 
operations,  leaving  only  the  reticle-scene  multiply-add  operation  to  be  done  on  a  repeated 
basis  during  the  simulation.  As  a  result  of  this  research,  it  was  mutually  agreed  upon 
with  AFIWC  that  they  should  transition  JMASS  to  this  lookup-based  approach,  not  only 
to  support  integration  of  GPU  processing,  but  because  it  could  improve  JMASS 
performance  even  if  GPU  acceleration  were  not  used.  Hence,  modified  JMASS  can  run 
in  either  “GPU-assisted”  or  “Software”  modes.  The  GPU-assisted  version  performs  the 
reticle-scene  multiply-add  operation  in  GPU  hardware,  using  the  same  GPU  code 
implementation  that  was  used  in  the  Phase  Two  experiments.  The  Software  version 
accomplishes  the  calculations  in  software,  analogous  to  the  “C++”  processing  method 
used  in  the  Phase  Two  experiments.  GPU-assisted  JMASS  was  tested  with  both  ATI  and 
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nVidia  graphics  cards.  The  platform  was  a  2.8  GHz  Pentium  4  (HT)  with  512MB  RAM, 
running  Windows  XP  Professional  (SP1)  and  DirectX  9.0b.  Only  one  replication  of  each 
experiment  was  conducted. 


The  results  for  these  experiments  appear  in  Table  5-8.  From  the  table  it  can  be  seen 


that  Modified  JMASS,  both  the  Software  and  GPU-assisted  versions,  outperform  baseline 
JMASS  in  every  case.  From  the  analysis  at  Table  A-4,  the  Software  version  provides 
about  1.4x  speedup  over  baseline  JMASS,  and  the  GPU-assisted  version  provides  about 
1.5x  speedup  over  baseline  JMASS,  on  average. 


Table  5-8.  Execution  times,  in  seconds,  compared  for  original  JMASS,  modified  JMASS  and  GPU- 
assisted  JMASS,  at  the  three  image  sizes.  Each  experiment  consisted  of  running  the  JMASS  generic 
MANPADS  threat  model,  set  for  a  10  second  engagement. 

Modified  JMASS 


Baseline 

Multiply-add 

GPU-assisted 

Image  Size 

JMASS 

In  Software 

ATI 

nVidia 

1282 

579 

407 

360 

359 

2562 

2141 

1574 

1393 

1411 

5122 

8200 

6289 

5530 

5525 

Modified  JMASS  (Software)  can  be  viewed  as  the  first  of  two  incremental 
improvements  over  baseline  JMASS:  it  implements  the  more  efficient  lookup-based 
approach  described  above,  providing  1.4x  speedup  over  baseline  JMASS,  on  average. 

The  GPU-assisted  version  provides  a  second  incremental  improvement,  enhancing  the 
Software  version  by  performing  the  reticle-scene  multiply-add  operation  in  GPU 
hardware.  The  GPU  speeds  up  the  Software  version  by  about  l.lx,  providing  an  absolute 
speedup  over  baseline  JMASS  of  1.5x,  on  average.  Viewing  the  successive 
improvements  in  this  manner  reveals  that  the  biggest  performance  gain  for  JMASS  comes 
from  transitioning  to  the  lookup-based  approach,  and  using  the  GPU  to  further  optimize 
the  reticle-scene  calculations  provides  only  a  small  additional  benefit.  This  information 
is  summarized  in  Table  5-9. 
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As  indicated  above,  the  GPU  does  not  provide  much  of  a  performance  boost  to 

JMASS.  The  reason  for  this  lies  in  the  fact  that  Modified  JMASS  (Software  version),  by 

going  to  a  lookup-based  approach,  does  away  with  most  of  the  time-consuming  optics 

processing,  namely  the  reticle  rotation  and  interpolation  operations.  In  so  doing,  the 

optics  calculations  become  a  much  smaller  contributor  to  the  total  JMASS  execution 

time.  This  is  shown  in  Table  5-9,  which  gives  the  estimated4  proportion  of  JMASS 

Table  5-9.  Percentages  of  JMASS  execution  time  spent  performing  optics  versus  other  processing  for  the 
three  versions  of  JMASS,  and  the  speedup  provided  by  these  successive  improvements.  Optics  processing 
includes  reticle  rotation  and  interpolation,  and  the  reticle-scene  multiply- add  operations.  Modified  JMASS 
improves  Original  JMASS  by  essentially  eliminating  the  rotation  and  interpolation  operations  via 
preprocessing  and  look-up,  but  continues  to  perform  the  recticle-scene  multiply-add  operation  in  software. 
GPU-assisted  JMASS  improves  Modified  JMASS  by  performing  the  multiply-add  operation  in  GPU 
hardware.  It  can  be  seen  that  Modified  JMASS,  in  switching  to  a  look-up  based  approach,  makes 
significant  improvement  to  the  optics  processing.  GPU-assisted  JMASS  provides  further  optimization  to 
the  optics  processing,  virtually  eliminating  it  as  a  factor  in  the  total  JMASS  execution  time.  These  figures 
show  that  modifying  JMASS  to  pre-process  and  look-up  reticle  images  results  in  the  largest  improvement. 
Using  the  GPU  to  speed  up  the  remaining  reticle-scene  multiply-add  operation  adds  a  further  small 
improvement. 


Baseline  JMASS 


Modified  JMASS 

Software  GPU-Assisted 


Other 


Optics 


Other 


Optics 


Other 


Optics 


65% 


35% 


89% 


11% 


>99% 


<1% 


Incremental  Speedup 
over  previous  version 


1.4x 


1 .  lx 


Absolute  Speedup 
over  Baseline  JMASS 


1.4x 


1.5x 


execution  time  attributable  to  optics  processing  versus  other  activities  for  the  three  tested 
JMASS  versions.  In  baseline  JMASS,  optics  processing  accounts  for  about  35%  of  the 
total  execution  time,  whereas  in  Modified  JMASS  (Software),  it  only  accounts  for  11%. 


4  Estimates  derived  using  known  GPU  execution  times  and  the  JMASS  execution  times  from  Table  5-8. 
Example:  for  the  1282  image  size,  the  ATI  GPU  takes  no  more  than  2  seconds  to  process  the  2,425  spin 
scan  iterations  required  during  the  simulation  of  a  1 0  second  engagement.  Subtracting  2  seconds  from  the 
360  second  GPU-assisted  (ATI)  JMASS  execution  time  yields  358  seconds  for  all  “other”  processing. 
Dividing  this  number  by  the  execution  times  of  the  JMASS  versions  at  the  1282  image  size  gives  the 
fraction  of  the  total  time  spent  in  this  activity  for  each.  Percentages  shown  in  Table  5-9  are  averages. 
Actual  percentages  for  each  image  size  case  vary  by  +/-  3  percentage  points.  These  estimates  agree  with 
results  provided  by  profiler  software,  which  indicate  that  the  optics  processing  carried  out  in  Modified 
JMASS  (Software)  accounts  for  about  10%  of  the  execution  time  for  that  JMASS  version. 
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Using  Amdahl’s  famous  equation,  Modified  JMASS  (Software)  provides  about  3.8x 
speedup  for  the  optics  processing,  compared  to  baseline  JMASS.  At  this  point,  the  best 
speedup  attainable  by  further  optimizing  the  optics  processing  is  1.1 2x,  the  speedup  that 
would  be  gained  by  eliminating  the  optics  calculations  altogether.  The  GPU  therefore 
performs  admirably  in  these  experiments,  because  it  almost  accomplishes  this,  with  GPU- 
assisted  JMASS  reducing  the  optics  processing  time  to  1%  or  less  of  the  total  JMASS 
execution  time.  Equivalently,  the  GPU  provides  speedup  on  the  order  of  10-40x 
(depending  on  the  GPU  and  image  size  used)  for  the  optics  processing  compared  to 
Modified  JMASS  (Software).  The  end  result  of  all  the  improvements  is  the  elimination 
of  about  35%  of  the  baseline  JMASS  execution  time,  which  is  a  significant  improvement. 
Unfortunately,  the  majority  of  this  improvement  is  due  to  the  efficiency  of  the  lookup- 
based  approach,  and  not  the  GPU.  The  reticle-scene  multiply-add  operation  does  not 
account  for  enough  of  the  total  JMASS  execution  time  for  the  GPU  to  make  a  big 
difference  overall. 

Somewhat  puzzling  in  these  results  is  the  10-40x  speedup  indicated  for  the  GPU- 
assisted  versus  non-GPU  versions  of  Modified  JMASS.  Recall  that  the  only  difference 
between  the  two  versions  is  the  method  used  for  processing  the  reticle-scene  multiply-add 
operation:  GPU  hardware,  or  software.  The  GPU  speedup  observed  in  these  experiments 
is  not  in  line  with  the  results  of  the  Phase  Two  experiments,  in  which  the  GPU  yielded  a 
maximum  of  about  7x  speedup  over  the  software-based  implementation.  Using  profiler 
software,  it  was  verified  that  the  reticle-scene  multiply-add  operation  in  Modified  JMASS 
(Software)  accounted  for  about  10%  of  the  total  execution  time,  meaning  that  for  some 
reason,  it  runs  considerably  slower  than  the  functionally  equivalent  routine  used  in  the 
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Phase  Two  experiments.  One  possible  explanation  for  this  difference  is  that  JMASS 
represents  the  scene  as  a  C++  object,  containing  an  assortment  of  attributes  and  methods, 
versus  using  a  simple  array.  AFIWC  is  investigating  the  cause  of  the  apparent 
inefficiency.  If  the  inefficiency  can  be  overcome,  and  the  Modified  JMASS  multiply-add 
routine  can  be  made  to  run  as  fast  as  the  one  used  in  the  Phase  Two  experiments,  it  is 
expected  that  the  already  small  advantage  provided  by  the  GPU-assisted  version  will 
become  even  less  significant,  especially  at  the  smaller  two  image  sizes. 

An  inconsistency  seems  to  exist  in  the  results  due  to  the  small  difference  between  the 
ATI  and  nVidia  cases  (see  Table  5-8).  Using  known  GPU  times  for  executing  the 
approximately  2,500  iterations  required  for  simulating  a  10-second  engagement,  the  ATI 
and  nVidia  cards  should  be  expected  to  differ  in  their  execution  times  by  approximately 
5,  30,  and  70  seconds  at  the  128 , 256  and  512  image  sizes  respectively.  However,  the 
actual  differences  observed  in  JMASS  execution  time  when  using  the  different  graphics 
cards  were  only  1,17  and  5  seconds  for  the  respective  image  sizes. 

The  simplest  explanation  for  this  disparity  between  expected  and  observed  differences 
in  execution  time  is  that  JMASS  execution  times  can  vary  from  run  to  run  enough  to 
mask  the  differences  in  GPU  performance.  In  this  case,  a  variation  of  1-2%  would  be 
enough.  However,  the  existence  of  such  variance  cannot  be  confirmed  because  only  one 
replication  of  each  experiment  was  performed. 

Another  possibility  that  was  investigated  is  whether  the  graphics  cards  behave 
differently  when  there  is  significant  time  delay  between  calls  to  the  GPU.  In  the  first 
three  phases  of  experiments,  the  GPU  was  tested  by  calling  its  processing  algorithm 
1,000  times,  back  to  back,  with  almost  no  delay  between  calls.  However,  with  JMASS, 
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there  can  be  more  than  two  seconds  between  calls  to  the  GPU.  To  see  if  this  was  a  factor, 
some  experiments  were  run  with  similar  delays  inserted  between  calls  to  the  GPU.  After 
running  the  experiment  using  different  image  sizes  and  delay  times  ranging  from  0.1  to 
about  two  seconds,  no  differences  were  observed  in  GPU  execution  time.  However,  large 
fluctuations,  sometimes  over  10%,  were  observed  in  the  delay  times  themselves,  even 
when  the  GPU  was  completely  removed  from  the  experiment.  Further  investigation 
revealed  that  the  variance  in  execution  time  of  the  delay  loop  generally  increased  when 
the  size  of  the  dummy  array  was  increased,  and  most  dramatically  when  it  was  increased 
so  as  to  exceed  the  capacity  of  the  CPU’s  level  two  cache.  Though  by  no  means 
conclusive,  such  variation  in  the  execution  of  a  simple  loop  makes  it  conceivable  that 
similar  variation  could  exist  in  the  execution  of  a  large  and  complex  program  like 
JMASS. 

One  other  possible  explanation  exists  for  the  above -noted  inconsistency,  having  to  do 
with  the  difference  in  the  floating  point  precision  of  the  two  cards.  Analysis  of  the 
JMASS  output  reveals  that  the  simulated  IR  detector  signals  produced  by  the  two 
graphics  cards  during  JMASS  simulation  differ  from  each  other,  and  from  that  produced 
by  baseline  JMASS.  This  comes  as  no  surprise,  since  baseline  JMASS  uses  double¬ 
precision  while  the  GPU  is  limited  to  single-precision,  or  a  subset  thereof  in  the  ATI 
case.  Per  an  AFIWC  subject  matter  expert,  it  is  possible  that  such  differences  could 
cause  the  simulated  missile  to  take  longer  to  acquire  or  reacquire  lock  on  the  target,  or  to 
lose  lock  more  often,  resulting  in  longer  simulations.  Another  example  of  the  graphics 
cards  producing  different  results  lies  in  the  “miss  distance”  displayed  by  JMASS  at  the 
end  of  the  simulation,  indicating  the  missile’s  final  proximity  to  the  target.  Given  the 
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same  simulation  parameters,  baseline  JMASS  produces  miss  distances  of  just  over  half  a 
meter,  nVidia  just  over  a  meter,  and  ATI  about  3  meters.  Because  nVidia  produces  miss 
distances  that  are  closer  to  those  generated  by  baseline  JMASS,  it  is  considered  more 
accurate.  It  is  a  possible  concern  that  ATI’s  considerably  larger  miss  distance  could  lead 
to  falsely  predicting  a  miss  when  a  more  accurate  simulation  would  predict  a  hit.  At  this 
point,  however,  it  is  only  known  that  these  differences  exist.  The  impact,  if  any,  such 
differences  might  have  on  the  outcome  and  validity  of  JMASS  simulations  remains  to  be 
established. 

Summary 

These  experiments  accomplished  the  research  goals  identified  in  Chapter  IV.  The  first 
phase  of  experiments  was  designed  to  compare  the  candidate  graphics  cards,  and  a  clear 
winner  emerged.  The  ATI  processor  outperformed  the  nVidia  GPU  in  all  cases, 
providing  an  average  5x  speedup  over  its  rival.  This  advantage  is  somewhat  unfair, 
however,  because  the  ATI  GPU  cuts  comers  with  respect  to  floating  point  precision, 
resulting  in  faster  processing,  but  less  accurate  results.  Though  it  is  too  early  to  tell,  these 
inaccuracies  may  make  this  card  unsuitable  for  the  JMASS  application.  The  ATI  and 
nVidia  GPUs  sustained  useful  work  rates  of  up  to  2.9  and  0.56  GFLOPS  respectively  in 
these  experiments,  displaying  formidable  processing  power — especially  considering  that 
these  figures  include  the  time  spent  transferring  data  into  and  out  of  the  graphics  cards. 

The  second  phase  of  experiments  pitted  GPU  hardware  acceleration  against  software- 
based  alternatives  for  implementing  the  JMASS  spin  scan  reticle-scene  multiply-add 
operation.  Using  the  faster  ATI  graphics  card  as  the  representative  GPU,  GPU  hardware 
consistently  outperformed  C++  and  Intel  Math  Kernel  Library  software  implementations, 
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providing  1.4x  to  3.5x  speedup,  with  the  GPU  achieving  its  greatest  advantage  when 
processing  the  largest  512  image  size. 

The  third  phase  of  experiments  compared  GPU  performance  against  the  same 
software-based  alternatives  for  executing  the  conical  scan  variation  of  the  JMASS 
multiply-add  operation.  In  all  but  one  case,  the  GPU  outperformed  C++  and  Intel  Math 
Kernel  Library  implementations,  providing  0.9x  to  2.5x  speedup. 

The  results  of  these  experiments  demonstrate  that  the  GPU  can  indeed  provide 
significant  speedup  over  software-based  alternatives  for  performing  both  the  spin  scan 
and  conical  scan  variations  of  the  JMASS  reticle-scene  multiply-add  operation. 

However,  as  was  forewarned  in  Chapter  I,  even  the  most  spectacular  GPU  speedup  could 
be  expected  to  have  little  effect  on  JMASS  system  performance  if  the  multiply-add 
operation  were  not  to  account  for  a  significant  amount  of  the  total  JMASS  execution 
time.  The  fourth  phase  of  experiments,  which  integrated  GPU  processing  into  JMASS, 
revealed  exactly  that.  The  full  suite  of  optics  calculations  performed  by  baseline  JMASS 
(rotate,  interpolate  and  multiply-add)  only  accounted  for  about  35%  of  the  total  baseline 
JMASS  execution  time,  which  is  much  less  than  originally  expected.  Further,  in  order  to 
integrate  GPU  processing  into  JMASS,  JMASS  was  modified  to  use  a  lookup-based 
approach  which  eliminated  the  bulk  of  the  optics  computations.  In  this  modified  version 
of  JMASS,  only  the  reticle-scene  multiply-add  operation  remained  to  be  optimized, 
accounting  for  only  11%  of  the  execution  time.  As  described  earlier  in  this  chapter,  the 
GPU  provided  excellent  acceleration,  reducing  the  time  spent  in  the  multiply-add 
operation  so  as  to  account  for  less  than  1%  of  the  total  JMASS  execution  time,  yielding 
close  to  the  theoretical  maximum  achievable  acceleration  of  1.1  x.  The  bottom  line,  with 
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respect  to  JMASS,  is  that  the  GPU  provided  the  best  possible  speedup  given  its  frequency 
of  use.  The  results  of  the  first  three  phases  of  experiments  indicate  that  the  GPU  could 
have  a  much  greater  impact,  providing  up  to  3.5x  speedup,  in  applications  where  the 
multiply-add  operation  accounts  for  the  bulk  of  the  execution  time. 

On  a  very  positive  note,  though  the  original  intent  of  transitioning  JMASS  to  the 
lookup-based  approach  was  to  enable  the  integration  of  GPU  processing,  it  resulted  in  a 
1.4x  speedup  over  baseline  JMASS.  With  the  inclusion  of  GPU  processing,  the  overall 
speedup  is  increased  to  1.5x.  This  equates  to  eliminating  40  minutes  of  a  two-hour 
simulation. 
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VI.  Discussion 


Summary  of  Findings 

This  research  demonstrates  GPU  hardware  can  support  JMASS  spin  scan  and  conical 
scan  simulations,  performing  the  reticle-scene  multiply-add  operation  up  to  3.5x  faster 
than  software -based  solutions  including  those  that  have  been  cache-optimized.  The  GPU 
advantage  is  greatest  when  processing  larger  image  sizes,  due  to  increased  computational 
intensity,  achieving  useful  work  rates  as  high  as  2.9  GFLOPS  for  this  application.  Two 
top-of-the-line  consumer  graphics  cards,  the  ATI  X800XT  and  nVidia  6800  Ultra,  were 
tested,  and  the  ATI  card  was  five  times  faster  on  average  than  its  nVidia  counterpart  in 
executing  the  JMASS  multiply-add  operation.  However,  the  ATI  card  is  also  less 
accurate  due  to  its  reduced  floating  point  precision,  which  may  or  may  not  impact  the 
validity  of  JMASS  simulation  results. 

This  research  resulted  in  a  1.5x  speedup  for  JMASS  by  fostering  its  transition  to  a 
lookup-based  approach  for  processing  the  reticle  images  which  eliminates  hundreds  of 
thousands  of  unnecessary  image  transformation  operations.  This  speedup  is  equivalent  to 
eliminating  40  minutes  of  every  2-hour  simulation,  and  therefore  delivers  a  significant, 
immediate  benefit  to  AFIWC. 

Nevertheless,  despite  the  speed  increases  afforded  by  the  graphics  cards  for 
performing  the  JMASS  image  processing  computations,  GPU  acceleration  impact  on 
overall  JMASS  performance  does  not  reflect  the  speedup  achieved  by  the  GPU.  This  is 
not  due  to  any  problem  with  the  GPU— the  GPU  executed  the  multiply-add  operation  up 
to  40  times  faster  than  the  JMASS  program— but  rather  the  multiply-add  operation 
accounts  for  just  a  small  portion  of  the  total  JMASS  execution  time,  so  optimizing  it  has 
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a  correspondingly  small  effect.  The  results  of  the  first  three  phases  of  experiments 
indicate  that  the  GPU  could  have  a  much  greater  impact,  providing  up  to  3.5x  speedup,  in 
applications  where  the  multiply-add  operation  accounts  for  the  bulk  of  the  total  execution 
time. 

Final  Observations  and  Recommendations 

Since  JMASS  only  uses  the  GPU  about  1%  of  the  time  for  the  multiply-add  operation, 
the  GPU  can  perform  other  JMASS  processing  as  well.  One  such  use  is  for  IR  scene 
generation.  Graphics  cards  excel  at  rendering  complex  and  dynamic  3D  scenes,  and  so 
will  be  faster  than  the  procedural  methods  currently  used  by  JMASS  to  generate  the  scene 
images.  Combining  scene  generation  and  multiply-add  operations  in  the  GPU  is  very 
efficient  because  the  scene  would  reside  natively  in  GPU  memory,  and  would  not  have  to 
be  uploaded  via  costly  data  transfers  to  the  GPU  after  every  scene  update. 

Efforts  to  accelerate  JMASS  more  using  hardware  should  be  focused  on  the  portions 
of  JMASS  which  have  not  been  optimized  (e.g.,  IR  scene  generation).  Further  effort  and 
expense  devoted  to  optimizing  the  JMASS  optics  calculations,  including  reticle  image 
rotation  and  interpolation,  and  reticle-scene  multiply-add  (a.k.a.  “convolution”),  is  not 
recommended  since  these  now  only  account  for  1%  of  the  JMASS  execution  time  when 
using  the  GPU  (11%  otherwise)  and  further  optimization  will  yield  no  noticeable 
performance  gain  for  JMASS  simulations. 

The  GPU  implementations  developed  in  this  research  can  be  further  optimized.  The 
“Palette”  approach  used  for  spin  scan  can  be  made  more  efficient  (cf.,  Chapter  III)  by  not 
processing  unneeded  images  at  the  top  and  bottom  rows  of  the  palette.  In  hindsight  this 
inefficiency  could  be  eliminated  altogether  by  modifying  the  algorithm  to  take  advantage 
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of  the  fact  that  sequencing  through  consecutive  groups  of  40  reticle  images,  mod  100, 
returns  to  the  initial  group  every  five  iterations.  Thus,  all  needed  reticle  image  orderings 
can  be  stored  in  five  smaller  palette  textures,  each  containing  40  reticle  images  instead  of 
64.  Since  each  palette  is  used  in  its  entirety,  there  is  no  need  to  resize  the  drawing 
rectangle,  and  no  processing  of  unwanted  images.  The  smaller  palette  textures  could  also 
support  the  512  image  size  within  GPU  memory  constraints,  whereas  the  current 
algorithm  uses  a  more  complicated  and  inefficient  procedure  to  deal  with  the  memory 
limitation  for  this  image  size.  The  proposed  approach  would  therefore  support  all  three 
image  sizes  with  a  more  efficient,  common  algorithm. 

Though  designed  specifically  to  support  the  JMASS  image  processing  requirement, 
the  GPU  implementations  developed  for  this  research  could,  with  some  modifications, 
support  any  application  that  requires  an  abundance  of  image  processing  operations 
involving  shifting  and  multiplying  images,  and  reducing  the  results.  However,  the  GPU 
hardware  imposes  some  restrictions  on  expandability.  In  designing  the  GPU-based 
algorithms,  the  chief  limitations  were  GPU  memory  capacity,  maximum  supported 
texture  size,  and  the  texture  dimension  and  shape  constraints  imposed  by  shader 
programs. 

Given  the  256MB  memory  capacity  of  the  GPUs,  5122  is  the  largest  reticle  image  size 
that  can  be  supported  if  all  100  reticle  images  are  to  be  stored  in  GPU  memory.  The  next 
larger  (power-of-two)  image  size,  10242,  cannot  be  supported  because  100  images  of  that 
size  would  require  400MB  of  GPU  memory,  and  few,  if  any,  graphics  cards  are  so 
equipped. 
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All  the  implementations  rely  on  large-sized  textures  for  storing  collections  of  images, 
such  as  those  used  for  the  reticle  palettes  and  for  storing  intermediate  results  between 
rendering  passes.  This  seems  to  be  a  GPU-efficient  approach.  However,  once  again  512 
is  the  largest  image  size  supported  if  a  texture  containing  64  images  is  desired.  nVidia 
allows  very  large  40962  texture  sizes,  but  actually  creating  a  floating  point  texture  of  that 
size  would  use  up  the  entire  256MB  of  available  memory!  Therefore,  if  future  GPUs  are 
improved  to  support  larger  textures,  GPU  memory  size  must  also  be  increased  for  it  to 
benefit  this  application. 

Per  Chapter  III,  pixel  shader  programs  impose  power-of-two  dimension  and  square 
shape  limitations  under  certain  circumstances.  These  restrictions  can  force  using  larger 
textures  than  necessary,  resulting  in  wasted  GPU  processing.  Another  limitation  with 
respect  to  pixel  shaders  is  the  limited  depth  of  dependent  texture  addressing  supported. 
Dependent  texture  addressing  allows  texture  coordinates  which  address  one  pixel  to  be 
used  to  derive  the  coordinates  for  another.  Limiting  this  practice  decreases  the  number  of 
adjacent  pixels  that  can  be  summed  or  multiplied  during  a  rendering  pass,  and  restricts 
the  creativity  of  the  programmer.  Removing  these  restrictions  could  allow  programmers 
to  create  more  efficient  algorithms. 

This  research  has  demonstrated  that  graphics  cards  can  provide  an  impressive 
performance  boost  for  a  general  computing  application,  provided  the  application  lends 
itself  to  SIMD  processing  and  can  maintain  high  enough  rates  of  computational  intensity. 
It  has  further  been  shown  that  GPU  acceleration  can  enable  slower  computers  to  meet  or 
exceed  the  performance  of  faster  and  otherwise  better-equipped  machines.  If  GPU 
technology  continues  to  improve  as  it  has  (and  given  the  current  state  of  the  PC  gaming 
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industry  there  is  no  reason  to  expect  otherwise),  the  limitations  described  above  are  not 
likely  to  exist  for  long,  and  the  GPU  could  indeed  become  the  processor  of  choice  for 
many  applications.  In  the  meantime,  the  latest  graphics  cards,  which  support  floating 
point  operations  and  can  be  flexibly  programmed  via  rich  APIs  and  shader  programming 
languages,  are  better  prepared  than  ever  to  meet  the  demands  of  scientific,  engineering 
and  modeling  and  simulation  applications. 
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Appendix  A.  Analysis  Tables  and  Figures 
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TABLE  A-lb 

Results  and  analysis  of  Phase  One  experiments,  512  image  size  case. 
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x  1 0  Quantile  quanti le  Plot  PHI  b  (LOG  transform)  Mean  Response  vs .  Residuals  -  PH1b( LOG  tra reform 
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TABLE  A-2a 

Resulls  and  analysis  of  Phase  Two  experiments,  comparing  GPU  to  CPU-based  approaches, 
at  three  image  sizes  and  three  scene  update  schemes,  on  PCi-e  and  AGP  platforms. 
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Computation  of  Interaction  Effects 
A  BCD  Effects 
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Figure  A-2.  Normal  quantile-quantile  and  errors-versus-responses  plots  for  Phase  2  (spin  scan,  GPU,  C++,  MKL)  experiments. 
Quantile-quantile  plot  is  reasonably  linear,  except  for  a  few  outliers.  Errors-versus-responses  plot  shows  no  trend.  Analysis  me 
is  therefore  valid. 
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160.077  99.999  Total  percent 

SSE  9.29E-04  0.001 


TABLE  A-2c 

Subset  of  Phase  Two  Experiments  Compares  GPU  versus  CPU-based  methods  on  PCI-e  and  AGP 
platforms.  Separate  analysis  performed  for  each  image  size  case.  Non-changing  scene  update  scheme 
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TABLE  A-2c  (continued) 
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TABLE  A-3 


Results  and  analysis  of  Phase  Three  experiments.  Compares  GPU  versus  CPU-based 
methods  for  the  JMASS  conical  scan  procedure. 
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TABLE  A-3  (continued) 
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Figure  A-3.  Normal  quantile-quantile  and  errors- versus-responses  plots  for  Phase  3  (conical  scan,  GPU,  C++,  MKL) 
experiments.  Quantile-quantile  plot  is  reasonably  linear,  except  for  a  few  outliers.  Errors-versus-responses  plot  shows 
no  trend.  Analysis  model  is  therefore  valid 


TABLE  A-4 

Analysis  of  Phase  Four  experiments,  comparing  baseline  JMASS  to  Modified  JMASS,  Software  and  GPU-assisted  versions. 


JMASS  Execution  Time 


JMASS 

JMASS 

GPU  Assisted 

Image  Size 

Orig 

Modified 

NV 

ATi 

128 

579 

407 

359 

360 

256 

2141 

1574 

1411 

1393 

512 

8200 

6289 

5525 

5530 

JMASS  Execution  Time  (loa  transform) 

JMASS 

JMASS 

GPU  Assisted 

Image  Size 

Orig 

Modified 

NV 

ATi 

sum  of  squares 
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128 

2,763 
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3.914 

3.799 
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SSY  126.384 

SSO  123,535 
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Appendix  B.  GPU  Implementation  Code 

Contents 

File  Page 

winAppGPU.cpp . 114 

GPUcombined.h  (spin  scan  implementation) . 118 

vs_bigtex.txt . 131 

ps_bigtex.txt . 132 

vs_onebyone.txt . 133 

ps_onebyone.txt . 134 

vs_  1 6tapredux_2 .  txt . 135 

ps_  1 6tapredux_2 .  txt . 136 

winCONSCAN.cpp . 137 

GPU_CONSCAN.h  (conical  scan  implementation) . 141 

vs_CONSCAN.txt . 152 

ps_CONSCAN.txt . 153 

GPU  UTILITY.h  (utility  routines  used  by  both  implementations) . 154 
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II  wi  nAppGPU.  c  pp 
II 

II  by:  Maj  Sean  Jeffers 

II  descr:  windows  test  application  for  GPU-based  algorithms 

II  27  dec  04  -  -  modified  to  output  both  normal  and  log-transformed  execution  time  data 


#i  nc I  ude  " st  daf  x.  h" 

#i  nc  I  ude  "wi  nAppGPU.  h" 

#def i  ne  MAX_ LOADSTRI  NG  100 


#i  nc  I  ude  <i  os  t  r  ea  m> 
#i  nc  I  ude  <f  s  t  r  e a m> 
#\  nc  I  ude  <i  omani  p> 


#i  nc  I  ude  <c  mat  h  > 

#i  nc  I  ude  "  GP  U_  COMBI  NED.  h"  II  combined. h  or  CLASS,  ONE  BYONE .  h  C  L  A  S  S  _  ON  E  B  Y  ON  E  _  R  3  2  F .  h 


II  Gl  o  b  a  I  Var  i  abl  es : 

HI  NSTANCE  hi  nst ; 

TCHAR  szTi  1 1  e[  MAX  LOADSTRI  NG]  ; 

TCHAR  szWi  ndowCI  a  s  s  [  MAX  LOADSTRING]; 


const  i  nt  EXPER  =  100; 

const  i  nt  SCENE  SI  ZE  =  128; 

c  o  n  s  t  i  n  t  WL  "  =  1 ; 

c  o  n  s  t  c  h  a  r  *  B  U  S  s  t  r  =  "  P  C I  -  e " ; 

const  char*  APRCH  st  r  =  "ATI"; 

const  i  nt  REPS  "  =2; 


L  U  II  J  l  I  II  l  T\  L  r  J  -  L  , 

const  int  SIZE_SQ  =  SCE  NE_  S I  ZE*SCENE_SI  ZE; 


II  current  instance 

II  The  title  bar  text 

II  the  ma i  n  wi  ndow  class  na me 


II  WL  3  pt  source  vars 
const  int  xmin  =  SCENE  SIZE/4; 

const  int  ymin  =  SCENE"SIZE/4; 

const  int  xmax  =  SCENE-SI  ZE-xmi  n; 

const  int  ymax  =  SCENE-SI  ZE-ymi  n; 

//initial  conditions 
int  oldx  =  xmin; 
int  oldy  =  ymin; 
int  del  x  =  - 1; 
int  xi  nc  =  - 1; 
i  nt  del  y  =  0; 
int  y  i  nc  =  - 1; 

II  Forward  declarations  of  functions  included  in  this  code  module: 
voi  d  Updat  eScenef int  ,  float*); 

ATOM  MyRegi  sterCI  ass(  HI  NSTANCE  hlnstance); 

BOOL  I  ni  1 1  nst  a  n  c  e  (  HI  NSTANCE,  int); 

LRESULT  CALLBACK  Wn d P r  o c  (  HWND,  Ul  NT,  WPARAM,  L P ARAM)  ; 

LRESULT  CALLBACK  About  (  HWND,  Ul  NT,  WPARAM,  L P ARAM)  ; 


int  API  ENTRY  t  Wi  n  Ma  i  n  (  HI  NSTANCE  hlnstance, 

HI  NSTANCE  hPrevI  nst  ance, 
LPTSTR  IpCmdLine, 

i  nt  nCmdShow) 


float 

eti  cl  e[ SI  ZE 

SQ] ; 

float 

scene!  SI  ZE  SO] ; 

doubl  e 

answer  [  4 0 T ; 

doubl  e 

t  i  me  s [ REPS] ; 

doubl  e 

t  i  mesl  og[  REPS] ; 

doubl  e 

s  t  a  r  t  T  i  me ; 

doubl  e 

endTi  me; 

doubl  e 

s  u  m 

=  0.  0; 

doubl  e 

mean 

=  0.  0; 

doubl  e 

var 

=  0.  0; 

doubl  e 

s  t  dev 

=  0.  0; 

doubl  e 

h  i 

=  0.  0; 

doubl  e 

1  ow 

=  0.  0; 

doubl  e 

SOS 

=  0.  0; 

doubl  e 

s  u  ml  o  g 

=  0.  0; 

doubl  e 

me  a  n  1  o  g 

=  0.  0; 

doubl  e 

va  r 1  og 

=  0.  0; 

doubl  e 

st  devl og 

=  0.  0; 

doubl  e 

s  os  1  og 

=  0.  0; 

doubl  e 

hi  1  og 

=  0.  0; 

doubl  e 

1  owl  og 

=  0.  0; 

char  WL  s  t  r [ 3  0  ] ; 
if  (WL  =  =  1){ 

s  t  r  c  p  y  (  WL  _  s  t  r ,  "  1  -  non-changing"); 
else  if  ( WL  ==  2)  { 

s  t  r  c  p  y  (  WL  _  s  t  r ,  "  2  -  fully-changing"); 

else  { 

strcpy(WL_str,  "3  -  mo v i  n g  p t  source"); 


II  instantiate  GPU  obj  ect 

Gpu  gpu(  hi  nstance,  nCmdShow,  SCENE  SIZE); 

//upload  reticles 

for  (int  i  =0;  i  <100;  i  ++)  { 

for  ( i  nt  j  =0;  j  <SI  ZE  SO;  j  +  +  )  { 
r  e t  i  cl  e [ j  ]  ="(  f I  oat )  i  ; 

gpu. upl oadReti  cl  e( i ,  reti  cl e) ; 


char  a  I  gor  i  t  hm[  40] ; 
int  al  g  =gpu.  Get  Al  g( ) ; 
if  ( a  I  g  =  =  1){ 

s  t  r  c  py  ( a  I  gor  i  t  hm,  "  Bl  GTEX" ) ; 

else  if  ( al  g  ==  2)  { 

s  t  r  c  py  ( a  I  gori  thm,  11  ONEBYONE" ) ; 
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st  art  Time  =  (doubl  e)ti  meGetTi  me(); 

II  run  al  gor  i  t  hm  1  00  0  x 
for  ( i  nt  i  =0;  i  <1  0  0  0;  i  ++)  { 
g  p  u .  P  r  o  c  e  s  s  ( i  %1 0  0 ,  scene, answer) ; 

UpdateScenefWL,  scene); 

endTi  me  =  (double)  t  i  me  Get  Ti  me( ) ; 

double  time  Delta  =  (endTi  me-startTi  me)*0.  OOlf; 

double  ti  meDel  taLog  =  I  o  g  1 0  ( t  i  me  Delta); 

t  i  mes [  r  ep]  =  t  i  meDel  t  a; 

ti  mesl  og[  rep]  =  t  i  me  Del  t  a  Log; 

s  urn  +=  t  i  meDel  t  a; 

s  uml  og  +=  t  i  meDel  taLog; 

//calc  stats 

mean  =  sum/  ( doubl  e)  REPS; 
meanlog  =  suml  og  /  ( doubl  e)  REPS; 
hi  =  0.  0; 
hi  I  og  =  -  1  0  0  0.  0; 

I  ow  =  1  0  00.  0; 

I  owl  og  =  1  0  0  0.  0; 

for  ( i  nt  i  =0;  i  <REPS;  i  ++)  { 
if  ( t  i  me  s  [  i  j  >  h  i  ) 

hi  =  t  i  me s [  i  ] ; 
if  ( t  i  mes  [  i  ]  <1  ow) 

low  =  t  i  me  s  [  i  ] ; 

var  +=  pow(  (ti  mes[i  ]-mean),  2.  0)/(doubl  e)(REPS-l); 
sos  +=  pow(  t  i  mes  [  i  ] ,  2) ; 
if  (ti  mesl  og[i  ]>hi  I  og) 

hilog  =  t  i  me  s  I  o  g  [  i  ] ; 
if  (ti  mesl  og[i  ]<l  owl  og) 

I  owl  og  =  t  i  mes  I  og  [  i  ] ; 

varlog  +=  pow(  (ti  me  s  I  o  g  [  i  ]  -  meanlog),  2.  0)/(doubie)(REPS-l); 
soslog  +=  p  o  w(  t  i  mes  I  og[  i  ] ,  2) ; 

st  dev  =  sqr  t (  var ) ; 
st  devl og  =  sqrt  (varlog); 

//write  results  to  file 
char*  name  ="resul  ts/resul  ts  11 ; 
char*  ext  =" .  dat " ; 
char  num[  4] ; 

i  t  o a (  EXPER,  num,  10) ; 
char  f  i  I  e  n  a  me  [  4  0  ] ; 
s  t  r  c  p  y  ( f  i  I  e  n  a  me ,  n  a  me ) ; 
strcatffi  I  ename,  num); 
strcatffi  I  ename,  ext) ; 

std:  :  ofstream  outFi  I  e(fi  I  ename,  std:  :  i  os:  :  app) ;  II  out 
i  f  ( ! out  F i I  e)  { 

::MessageBox(0,  "can't  open  results  file",  "GPU"  ,  0); 
exi  t ( 1) ; 

} 

outFi  I  e  <<"  exper  i  ment  #:  "  <<E X P E R <<'  \  n 1 

<<"  wor  kl  oad:  "  <<WL  s t  r  <<'  \  n 1 

<<"  bus :  "  <<BU5  s  t  r  <<'  \  n 1 

<<"approach:  "<<APRCH  s  t  r  <<'  \  n 1 

<<"  a  I  gor  i  t  hm:  "  <<al  gor  Ft  hm<<'  \  n1 

<<"  si  ze:  11  «SCENE  SI  ZE <<'  \  n1 

<<"  mean:  "  <<mean<<'  \  n' 

<<"  variance:  "  <<var  <<'  \  n1 

<<"  st  dev:  "  <<st  d e v <<'  \  n' 

<<"  s  urn  of  s  qr  s :  "  <<sos  <<'  \  n1 

<<"  low:  "  <<l  ow«'  \  n 1 

<<"  hi  :  "  < < h i  < < 1  \  n 1 

<<"mean  log:  "  <<mea  n  I  og  <<' \  n ' 

<<"vari  ance  log:  "  <<var  I  o  g  <  <'  \  n1 

<  < "  s  t  dev  log  :  "  <<st  devl  o  g  <<'  \  n1 

<<"sumof  sqrs  log:  "  <<s  os  I  o  g  <<'  \  n1 
<<"  I  ow  I  og:  "  <<l  owl  o g < <'  \  n1 

<<"  hi  log:  "  <<hi  I  o  g  <<'  \  n1 

<<"  reps:  "  <<REPS<<'  \  n' 

<<"  data:  "  <  < 1  \  n 1  ; 

for  ( i  =0;  i  <REPS;  i  ++)  { 

outFi  I  e < <t  i  mes  [  i  ] ; 

if  (  !  (  (  i  + 1 )  %5 )  I  |  (  i  =  =  R  E  PS-  1)  ) 

outFi  f  e  <  <'  \  t 1  <<"  ..."<<'  \  n1  ; 
else  outFi  I  e  <<'  \  t 1  ; 

} 

out  F  i  I  e<<' \  n1  <<"  dat  a  log:  "<<'\n'; 

for  ( i  =0;  i  <REPS;  i  ++)  { 

outFi  I  e  <  <s  t  d :  :  setpreci  si  o  n  ( 6 )  <  <s  t  d :  :  s  e  t  w(  3 )  <<t  i  mesl  og[i  ]; 
if  (  !  ( (  i  + 1 )  %5 )  I  |  ( i  =  =  R  E  PS-  1) ) 

out  Fi  I  e<<'  \  t 1  <<"  ..."<<'  \  n1  ; 
else  outFi  I  e  <<'  \  t 1  ; 

} 

outFi  I  e  < <'  \  n 1  <<"  a  ns  wer  s :  "  <<'  \  n1  ; 
for  ( i  =0;  i  <40;  i  ++)  { 
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out  F i  I  e < <s t  d :  :  s et  pr  ec i  si  on(  9)  <<s t  d :  :  s et  w(  18)  << 

s t  d:  :  set  i  osf  I  ags ( st  d: : i os : : sci  ent i f i  c)  <<ans  wer [ i  ] ; 
if  M((  i  +D  %4) ) 

out  Fi  I  e<<'  \  n1  ; 

} 

o  ut  F  i  I  e  <<'  \  n 1  ; 

/  /  TODO:  Place  code  here. 


MSG  msg; 

HACCEL  hAccel Tabl  e; 

II  Initialize  global  strings 

LoadStri  ng(  hi  nstance,  IDS  APP  TITLE,  szTitle,  MAX  LOADSTRI  NG) ; 
LoadStri  ng(  hi  nstance,  I  DC'WI  NAPPGPU,  sz Wi  ndowCI  ass,  MAX  LOADSTRI  NG) ; 
My  Re  g  i  sterCI  a  s  s  ( hi  nstance!; 

II  Perf  or  m  appl  i  cat  i  on  initialization: 
if  (  !  I  ni  tl  nstance  (hlnstance,  nCmdShow)) 

{ 

return  FALSE; 

} 


hAccel  Table  =  LoadAccel  erators(hl  nstance,  ( LPCTSTR)  I  DC_WI  NAPPGPU) ; 

II  Mai  n  message  loop: 

while  (  Get  Message!  &msg,  NULL,  0,  0)) 

if  (  !  Tr  a  ns  I  at  e  Ac  c  el  er  at  or  (  ms  g.  h  wnd,  hAccel  Tabl  e,  &msg)) 

Trans  I  at  e  Me  s  s  a  g  e  (  &msg) ; 

Di  spat  chMessage(  &msg) ; 

}  } 


return  ( i  nt )  ms  g .  wPa  r  a m; 


II  FUNCTION:  Updat  eScene( ) 


void  UpdateScene(  i  nt  p  WL,  float*  p  scene)  { 
if  (p.wl  ==  i) r 
return; 

} 

if  (  p  _  WL  ==  2)  { 

f  or  ( i  nt  j  =  0;  j  <  SI  ZE  SQ;  j  ++)  { 
p _ s c e n e [ j  ]  +=  I.  Of ; 

return; 


else 


i  nt  x  =  del  x  +  ol  dx; 
i  nt  y  =  del  y  +  ol  dy; 
if  (  ( x<xmi  n)  |  I  ( x>xmax)  )  { 

x  =  o  f  d  x ; 

xi  nc  =  -  xi  nc; 
delx  =  del  x+xi  nc; 
dely  =  del  y+yi  nc; 
y  +=  del  y; 


i  f 


} 


( y  <y  mi  n )  |  I  (  y  >y  ma  x )  )  { 
y  =  ol dy ; 


y i  nc  =  - yi 
delx  +=  xi 
dely  +=  yi 
+=  delx; 


nc; 

nc; 

nc; 


i  nt  i  ndex  =  y *  SCE NE  SI ZE  +  x; 

i  nt  i  ndexol  d  =  ol  dy*SCENE  SI  ZE  +  ol  dx; 

p  scene! i  ndexol  d]  =  0.  Of ; " 

p"scene[  index]  =  1.  Of ; 

of  dx  =  x ; 

ol  dy  =y; 

return; 

} 

} 


II  FUNCTION:  My  Re  g  i  s  t  e  r  Cl  a  s  s  (  ) 

II 

II  PURPOSE:  Registers  the  window  class. 

II 

II  COMMENTS: 

II 

II  This  function  and  its  usage  are  only  necessary  if  you  want  this  code 

II  to  be  compatible  with  Win32  systems  prior  to  the  1  Regi  sterCI  assEx1 

II  function  that  was  added  to  Windows  95.  It  is  important  to  call  this  function 

II  so  that  the  application  will  get  'well  formed1  small  icons  associated 

II  wi  t  h  i  t . 

II 

ATOM  MyRegi  sterCI  ass(HI  NSTANCE  hlnstance) 

WNDCLASSEX  wcex; 


wcex. cbSi ze  =  si  zeof  (  WNDCLASSEX) ; 


wcex.  st  yl  e 
wcex.  I  pfnWndProc 
wcex.  cbCI  sExtra 
wcex.  cbWndExtra 
wcex.  hi  nst  ance 
wcex.  hi  con 
wcex.  hCu r  s o r 
wcex.  hbr  Background 
wcex.  1  pszMenuName 
wcex.  1  ps  z Cl  as s  Na me 
wcex.  hi  conSm 


CS  HREDRAW  I  CS  VREDRAW; 

(  WNDPROC)  WndProc; 

0; 

0; 

hi  nstance; 

Loadl  con{  hi  nstance,  (  LPCTSTR)  I  DI  WINAPPGPU); 
LoadCursorf  NULL,  I  DC  ARROW)  ; 

(  HBRUSH)  (  COLOR  WI  NDOW+1)  ; 

(LPCTSTR)  I  DC  Wf NAPPGPU; 
s  z  Wi  n  d  o  wC  I  ass; 

Loadl  c o n (  wcex.  hi  nst  ance,  (  LPCTSTR)  I  DI  _S MALL)  ; 
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return  Regi  sterCI  assEx(  &wcex) ; 

II 

II  FUNCTION:  I  ni  1 1  nst  a  n  c  e  (  HANDLE,  int) 

II 

II  PURPOSE:  Saves  instance  handle  and  creates  main  window 
II 

II  COMMENTS: 

II 

II  In  this  function,  we  save  the  instance  handle  in  a  global  variable  and 

II  create  and  display  the  main  pr  ogr  am  wi  ndow. 

II 

BOOL  I  ni  1 1  nst  a  n  c  e  (  HI  NSTANCE  hlnstance,  int  nCmdShow) 

{ 

HWND  hWnd; 

hlnst  =  hlnstance;  II  Store  instance  handle  in  our  global  variable 

hWnd  =  CreateWi  ndow(  szWi  ndowCI  ass,  szTitle,  WS  OVERLAPPEDWI  NDOW, 

C  W_  USEDEFAULT,  0,  C  W_  USEDEFAULT,  0,  NULL,  NOLL,  hlnstance,  NULL); 

if  (IhWnd) 

{ 

return  FALSE; 

} 

ShowWi  n d o w(  hWnd,  nCmdShow); 

Update  Wi  ndow(hWnd); 

return  TRUE; 

} 

II 

II  FUNCTION:  Wn  d  P  r  o  c  (  HWND,  unsigned,  WORD,  LONG) 

II 

II  PURPOSE:  Processes  messages  for  the  main  window. 

II 

II  WM  COMMAND  -  process  the  application  menu 

II  WM'PAINT  -  Paint  the  main  window 

II  WM'DESTROY  -  post  a  quit  message  and  return 

II  ~ 

II 

LRESULT  CALLBACK  WndProcfHWND  hWnd,  Ul  NT  message,  WP ARAM  wParam,  L P ARAM  IParam) 

int  wml  d,  wmEvent ; 

P  A I  NTSTRUCT  ps; 

HDC  hdc ; 


} 


swi  t  ch  (message) 

case  WM  COMMAND: 

wml  d  =  L OWORD(  wPa  r  a m)  ; 
wmEvent  =  H I  WOR D(  wParam)  ; 

II  Parse  the  menu  selections: 
s  wi  t  c  h  (  wml  d) 

{ 

case  I  DM  ABOUT: 

"  Di  al  og Box (  hi  nst ,  ( L PCTSTR)  I  DD  ABOUTBOX,  hWnd,  (  DL GP ROC)  Abo ut ) ; 
break; 

case  I  DM  EXI T: 

DestroyWi  ndow(  hWnd) ; 
break; 

default: 

return  Def  Wi  ndowPr  oc  ( hWnd,  message,  wParam,  IParam); 

break; 

case  WM  PAI  NT: 

hdc  =  Begi  nPai  nt  ( hWnd,  &ps) ; 

II  TODO:  Add  any  drawing  code  here... 

EndPai  nt  ( hWnd,  &ps ) ; 
break; 
case  WM  DESTROY: 

Post Qui  t Message! 0) ; 
break; 

default: 


return  Def  Wi  ndowPr  oc  ( hWnd,  message,  wParam,  IParam); 

return  0 ; 


II  Message  handler  for  about  box. 

LRESULT  CALLBACK  About  (  H  WN  D  h  DI  g ,  Ul  NT  message,  WPARAM  wParam,  L  P  ARAM  IParam) 

swi  t  ch  (message) 

case  WM  I  NI  TDI  ALOG: 

return  TRUE; 


} 


case  WM  COMMAND: 


} 

return 


'  if  (  LOWORD(  wParam)  ==  I  DOK  ||  LOWORD(  wPa  r  a  m) 

EndDi  al  og(  h DI  g,  L OWORD(  wParam)); 
return  TRUE; 

} 

break; 

FALSE; 


I  DCANCEL) 
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II  file:  GPU  combi  ned.  h 
II 

II  by:  Maj  Sean  J  ef  f  er  s 

II  requires  external  files: 

II  GPU  UTI  LI TY.  h 
II 

II  sour  c e /  vs  bigtex.txt 
II  source/ ps'bi  gt  ex. t  xt 
/ /  sour  c e / ps'onebyone. t  xt 
II  source/ vs'onebyone. t  xt 
II  source/ vs-16t  apredux  2.txt 
II  source/ ps-16t  apredux-2. txt 
II 

II  27  dec  -  -  combi  ned  Bl  GTEX  for 
II  Bl  GTEX  and  ONEBYONE 

II  t  o  R32F  f  or  speed; 

II 

#i  f  ndef  GPU  CLASS  H  BY  MAJ  j  EFFERS 
#def i  ne  GPU" CL  AS  S' H" BY" MAJ "j  EFFERS 


--  contains  namespace  d  3  d  utility  functions  I  n  i  t  D3  D( ) 
GPU  WndProc  CALLBACK  and  Gpu  WndClass  definition 
--  vertex  shader  used  by  MAddReducef)  for  Bl  GTEX 
--  pixel  shader  used  by  MAddReducel)  for  Bl  GTEX 
--  PS  used  for  MAdd Reduc e( )  for  ONEBYONE 
--  PS  used  for  MAdd  Re  d  u  c  e ( )  for  ONEBYONE 
--  vertex  shader  used  by  Redux() 

-  -  pi xel  shader  used  by  Redux( ) 

1  2  8/  2  5  6  and  ONEBYONE  for  512  size;  modified  both 
pixel  shaders  to  take  128-bit  tex's  in,  but  output 
ps_experi  mental  and  ps_maddreduce_newwere  modified 


#i  nc  I  ude  <d 3 d x 9 .  h> 

#i  nc I  ude  "  GP U_ UTI  LI TY.  h" 


#i  nc  I  ude  <s  t  d I  i  b .  h > 
#i  nc  I  ude  <cst  r  i  ng> 
#i  nc  I  ude  <c  mat  h  > 


II .  CONSTANTS . 

#def  i  ne  GPU  Wl  NDOW  Wl  DTH  1024 

#def  i  ne  GPU" Wl  NDOW'HEI  GHT  768 

#def i  ne  D3 D" F ORMAT"  D3DFMT  A32B32G32R32F 

#def i  ne  ST Rf DE  16 

II . 


class  Gpu  { 


pr i  vat  e: 

HI  NSTANCE 
i  nt 

I  Di  r  ect  3 D De v i  c e 9 * 
const  i  nt 

i  nt 

float 

float 

II  D3DXVECTOR4 
i  nt 

I  ong 

II I  nt 
i  nt 

i  nt 
bool 
i  nt 
i  nt 
i  nt 


hi  nst ; 
nCmdShow; 

Devi  c e; 

SceneSi  ze; 

ScenePi  xel  s ; 
f  Pi  xSi  z eX; 
f  Pi  xSi  zeY; 

Dat  a  A  r r  a  y [  2  048*  2  0  4  8  ] ; 
Out TexSi  z e; 

Out  Pi  xel  s ; 

Vi  ewpor  t  Size; 

Reducel t  er  at i  ons ; 

Texl  ndex; 

Dual  RT; 
s  t  a  r  1 1; 
endl; 
e  n  d  2 ; 


/  /  VS1 

I  Di  r  ect  3DVer  t  exShader  9* 

I  D3 DXCo ns  t  a  nt  Ta bl  e* 

D3  DXHANDL  E 

II  PS1 

I  Di  r  ect  3 D P i  xel  Shader  9* 

I  D3  DXCo  ns  t  a  nt  Ta  bl  e* 

D3  DXHANDL  E 


VS  1  madd  reduce; 
VSl'VSCT; 

VS II P i  xel  Si  zeHandl  e; 

PS1  madd reduce; 
PSl'PSCT; 

P S 1 1 P i  xel  Si  zeHandl  e; 


/  /  VS2 

I  Di  r  ect  3DVer  t  exShader  9* 
I  D3  DXCo  ns  t  a  nt  Ta  bl  e* 

D3  DXHANDL  E 


VS2  16tapreduce; 
VS2" VSCT ; 

V S 2 1 o f f  set  Handl  e; 


II PS2 

I  Di  r  ect  3 D P i  xel  Shader  9* 
I  D3  DXCo  ns  t  a  nt  Ta  bl  e* 

D3  DXHANDL  E 
D3  DXHANDL  E 


PS2  16tapreduce; 
PS2" PSCT; 

P S 2" of  f set  Handl  e; 
P  S  2 1  mu  I  Handl  e; 


II  PARAMETERS  PASSED  TO  PS  &  VS 

D3DXVECTOR2  of f s et [ 3 ] [  16] ; 

II  VERTEX  BUFFER  &  DECL 

LPDI  RECT3 DVE RTEXDECL ARATI  ON9  m  p Dec  I  ; 

I  Di  r ect 3 DV e r t exBuf f er 9*  QuadVB; 


II 

II 


II  TEXTURES  &  SURFACES 
I  Di  r ec t 3 DText u r e9*  Scene  Tex; 

I  Di  rect3DSurface9*  ScenelSurf  ace; 


I  Di  r  ec  1 3  DText  u  r  69*  Reticle  T  e  x  [  10  0] ; 

IDirect3DSurface9*  Reticle~Surface[100]; 


IDirect3DTexture9*  RT  Tex; 

I  Di  r  ec  1 3  DSu  r  f  a  c  e9*  RT'Surface; 


I  Di  r ec 1 3 DText u r e9*  Pal  T e x [ 4 ] ; 

I  Di  r  ec  1 3  DSu  r  f  a  c  e9*  Pal~Surf[4]; 


I  Di  rect  3DText  ure9* 
I  Di  rect3DSurface9* 
I  Di  rect  3DText  ure9* 
I  Di  rect3DSurface9* 


RT  Reduce  Tex[ 3] ; 

RT" Reduce'Surf a  c  e [ 3] ; 
RT"Reduce"Tex2[ 3] ; 
RTReduce'Surf  a c e 2 [  3] ; 


I  Di  r ec t 3 DText u r e9*  Ones  Tex; 

I  Di  r ec 1 3 DSu r f a c e9*  OneslSurf  ace; 


II  t  r  a  n  s  f  o  r  ma  t  i  o  n  ma  t  r  i  c  e  s 
D3 DXMATRI  X  mWorld; 

D3 DXMATRI  X  mV i  ew; 

D3 DXMATRI  X  mProj; 
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II  .  STRUCTS 

struct  CUSTOMVE RT EX 

{ 

FLOAT  x; 

FLOAT  y; 


public: 

//constructor 

Gpu(  HI  NSTANCE  p  hlnst,  int  p  nCmdShow,  int  p  SceneSize) 

:  hi  nst  ( p  _  h  I  n  s  t ) ,  nCmdShow(  p_nCmdSFiow) ,  SceneSize  ( p_SceneSi  z  e  /  2) 

Devi  ce  =0; 


ScenePi  xel  s 

= 

SceneSi ze*SceneSi  ze; 

f  Pi  xSi  zeX 

= 

■  1.  Of 

(float)  SceneSi  ze; 

f  Pi  xSi  zeY 

= 

1.  Of 

(float)  SceneSi  ze; 

/  /  VS1 

VS1  maddreduce 

= 

0; 

VSl'VSCT 

= 

0; 

VSl'Pi  xel  Si  zeHandl  e 

= 

0; 

II  PSl 

PS1  maddreduce 

= 

0; 

PSl'PSCT 

= 

0; 

//V  52 

VS2  16t  apreduce 

= 

0; 

VS  2“ VSCT 

= 

0; 

VS2~of  f  set  Handl  e 

= 

0; 

II PS2 

PS2  16t  apreduce 

= 

0; 

PS2" PSCT 

= 

0; 

P S 2" of  f set  Handl  e 

= 

0; 

II  vertex  buffer  pti 

QuadVB 

= 

0; 

i  f(  !  d  3  d :  :  1  n  i  t  D3  D(  hi  nst,  nCmdShow, 

GPU  Wl  NDOW  Wl  DTH,  GPU  Wl  NDOW  HE  I  GHT,  true,  D3DDEVTYPE  HAL,  ^Device)) 
{  "  " 

:  :  MessageBoxf  0,  "  I  ni  t  D3D( )  -  F  A I  LED" ,  0,  0) ; 


i  f ( !  Set  up( ) )  { 

:  :  MessageBoxf  0,  "SetupU  -  F  A I  LED",  0,  0); 


Dual  RT  =  f  al  se; 

II  modular  actions  depending  on  algorithm 
I  ni t  Shader  s () ; 

I  ni t  Ret i  cl  esAndScenef ) ; 

I  ni t  RenderTargets_hybri d( ) ; 

}// Gpu()  CONSTRUCTOR 

pr i  vat  e: 


II  I  ni t  Shader  s ( ) 

II  creates  &  compiles  shaders 


bool  I  ni  tShaders( )  { 

HRESULT  hr  =  0; 

II  ***  PS1 

I  D3 DXBuf f er*  PSBuf f er  =  0; 

I  D3  DXBuf  f  e  r  *  errorBuf f er  =  0; 

char  pat  hPSl[ 50] = 
char  pat  h  V  S 1 [ 50] = 

char*  psbigtex  =  "source/ps  bigtex.txt"; 
char*  vsbigtex  =  "source/vs~bi  gtex.txt"; 
char*  psonebyone  =  "source/ps  onebyone.txt"; 
char*  vsonebyone  =  "source/vs'onebyone.txt"; 

i  f  ( SceneSi  ze  ==  256) { 

st  r  c p y ( pat  hPSl,  psonebyone) ; 
st  rcpy( pat  hVSl, vsonebyone) ; 

else  { 

st  r  cpy( pat  hPSl,  ps  bi  gt  ex) ; 
strcpy(  pathVSl,  vsbi  gtex); 


hr  =  D3 DX C o mp i  I  eShader  Fr  omFi  I  e( 
pat  h PS1, 

0, 

0, 

"PS Main",  II  entry  point  function  name 
"ps  2  0", 

D3DXSHADER  SKI  PVALI  DATI  ON, // DEBUG,  II  I  D3DXSHADER  SKI  POPTI  Ml  ZATI  ON 
&PSBuf  fer," 

&er  r  or  Buf  f  er , 

&PS1_  PSCT) ; 

II  output  any  error  messages 
iff  errorBuffer  ) 

{ 

::  MessageBoxf 0,  ( char*) errorBuffer- >Get  Buff  er  Poi  nt  erf ) ,  0,  0); 

Rel  e  a  s  e  <1  D3DX  Buffer*  >(  errorBuffer); 

} 

i  f  ( FAI  LEDf  hr) ) 

{ 

::  MessageBoxf  0,  "  PS1- - D3  DXCo  mp  i  I  eShader  Fr  omFi  I  e( )  -  FAILED",  0,  0); 
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return  false; 


} 

II  create  pixel  shader 

hr  =  Devi  ce- >Cr eat ePi  xel  Shader ( 

(  DWORD* )  PS  Buff er-  >Get Buffer Poi  nter(), 

6iPSl_madd reduce)  ; 

i  f  ( F  A I  LED(  hr) ) 

{ 

::MessageBox(0,  "CreatePi  xel  Shader  PS1  -  FAILED",  0,  0); 
return  false; 

} 


Re  I  e  a  s  e  <1  D3  DXBuf  f  e  r  *  >(  PSBuff  er) ; 

II  ***  PS2 

I  D3 DXBuf  f  e r  *  PS2Buf f er  =  0; 

I  D3  DXBuf  f  e  r  *  errorBuf f e  r  2  =  0; 

hr  =  D3 DX C o mp i  I  eShader  Fr  omFi  I  e( 

"source/ps  16tapredux  2.txt'1, 

0, 

0, 

"PS Main'1,  II  entry  point  function  name 
"ps  2  0", 

D3DXSHADER  SKI  PVALI  DATI  ON,  /  /  DEBUG,  II  I  D3DXSHADER  SKI  PVALI  DATI  ON  OPTI  Ml  ZATI  ON 
&PS2  Buf  f  er  7 
&er  r  or  Buf  f  er  2, 

&PS2_  PSCT) ; 

II  output  any  error  messages 
iff  er  r  or  Buf  f  er  2  ) 

{ 

: :  MessageBoxf 0,  ( char*) errorBuffer2- >Get  BufferPoi  nterf ) ,  0,  0); 

Rel  e  a  s  e  <  I  D3DXBuffer*>(  errorBuffer2) ; 

} 

i  f  ( F  A I  LEDf  hr) ) 

{ 

::  MessageBoxf  0,  "  PS2- - D3  DXCo  mp  i  I  eShader  Fr  omFi  I  e( )  -  FAILED",  0,  0); 
return  false; 


II  create  pixel  shader 

hr  =  Devi  ce- >Cr eat ePi  xel  Shader ( 

(  DWORD*)  PS  2 Buf f er- >Get  Buf f er Poi  nterf ) , 

&PS2_ 16t apr  educe) ; 

i  f  ( F  A I  LEDf  hr) ) 

{ 

:  :  MessageBoxfO,  "CreatePi  xel  Shader  PS2  -  FAILED",  0,  0); 
return  false; 

} 


Rel  e  a  s  e  <1  D3  DXBuf  f  er  *  >(  PS2  Buf  f  er) ; 

II  ***  VSl 

I  D3 DXBuf f er*  VSBuf f er  =  0; 

I  D3  DXBuf f er*  errorBuff e  r  3  =  0; 

hr  =  D3 DX C o mp i  I  eShader  Fr  omFi  I  e( 
pat  hVSl, 

0, 

0, 

"Main",  II  entry  point  function  name 
'  v  s  2  0 " , 

D3DXSHADER  SKI  PVALI  DATI  ON,  /  /  DEBUG,  II  I  D3DXSHADER  SKI  POPTI  Ml  ZATI  ON 
&VSBuffer," 

&er  r  or  Buf  f  er  3, 

&VS 1_ VSCT) ; 

II  output  any  error  messages 
iff  er  r  or  Buf  f  er  3  ) 

{ 

::  MessageBoxf 0,  ( char*) errorBuffer3- >Get  BufferPoi  nterf ) ,  0,  0); 

Rel  e a s e < I  D3DXBuffer*>(errorBuffer3); 

} 

i  f  ( F  A I  LEDf  hr) ) 

{ 

:  :  MessageBoxf  0,  "  VSl- - D3  DXCo  mp  i  I  eShader  Fr  omFi  I  e( )  -  FAILED",  0,  0); 
return  false; 


II  create  vertex  shader 

hr  =  Devi  ce- >Cr eat eVert exShader ( 

(  DWORD*)  VSBuf fer-  >Get  BufferPoi  nterf), 

&VSl_madd reduce)  ; 

i  f  ( F  A I  LEDf  hr) ) 

{ 

::  MessageBoxf  0,  11  Cr  eat  eVer  t  exShader  VSl  -  FAILED",  0,  0); 
return  false; 

} 


Rel  e a s e <1  D3 DXBuf  f  er  *>(  VSBuf  f  er ) ; 

II  ***  VS2 

I  D3  DXBuf f er*  VS2  Buf  f  er  =  0; 

I  D3  DXBuf f er*  errorBuf f  e  r  4  =  0; 

hr  =  D3 DX C o mp i  I  eShader  Fr  omFi  I  e( 

"source/vs  16tapredux  2.txt'1, 

0 ,  "  ' 

0, 

"Main",  II  entry  point  function  name 
"  v  s  _  2  _  0 " , 


120 


D3DXSHADER  SKI  PVALI  DATI  ON, // DEBUG,  II  I  D3DXSHADER  SKI  POPTI  Ml  ZATI  ON 
&VS2  Buffer; 

&er  r  or  Buf  f  er  4, 

&VS2_ VSCT) ; 

II  output  any  error  messages 
i  f  (  er  r  or  Buf  f  er  4  ) 

{ 

::  MessageBoxf 0,  ( c ha r * )  e r r or Buf f e r 4- >Get Buf f e r Poi  nt e r  ( ) ,  0,  0); 

Re  I  e  a  s  e  <  I  D3DXBuffer*>(  errorBuffer4) ; 

} 

i  f  ( F  A I  LED(  hr) ) 

{ 

::  MessageBoxf  0,  "  VS2- - D3  DXCo  mp  i  I  eShader  Fr  omFi  I  e( )  -  FAILED",  0,  0); 
return  false; 


II  create  vertex  shader 

hr  =  Devi  ce- >Cr eat eVert exShader ( 

(  DWORD*)  VS 2 Buf f er- >Get Buf f er Poi  nt er ( ) , 

&VS 2 _ 1 6 1 apr  educe) ; 

i  f  ( F  A I  LEDf  hr) ) 

{ 

::  MessageBoxf 0,  " Cr  eat  eVer  t  exShader  VS2  -  FAILED",  0,  0); 
return  false; 

} 


Re  I  e  a  s  e  <1  D3  DXBuf  f  er  *  >(  VS2  Buf  f  er ) ; 

II  get  VS1  pixelsize  constant  handle 

VS  1_ P i  xel  Si  zeHandl  e  =  VS1_VSCT- >Get  Const  ant  ByNamef  0,  "Pi  x e I  Si  z e " ) ; 
i  f  ( SceneSi  ze  !  =  256) { 

PSl_Pi  xel  Si  zeHandl  e  =  PS1_PSCT  -  >Get  Const  ant  ByNamef  0,  "Pi  xel  Si  z e " ) ; 
II  get  PS2  and  VS2  const  handles 

VS2  offsetHandle  =  VS2  VSCT- >Get  Const  ant  By  Namef  0,  "offset"); 

PS2“offsetHandl  e  =  PS2~PSCT- >Get  Const  ant  ByNamef  0,  "offset"); 

return  true; 

}//  I  ni t  Shader  s ( ) 


II  I  ni  t  Ret i  cl  esAndScenef ) 

II  loads  reticle  images  into  GPU,  creates  reticle  and  scene  surfaces 
II  and  textures  in  GPU  memory 


bool  I  ni  t  Ret  i  c I  esAndScenef )  { 


II  create  scene  texture  and  surface 


HRESULT  hr  =  0; 

hr  =  D3DXCr  eat  eText  ur  e( 

Devi  ce, 

SceneSi  ze,  SceneSi  ze, 

1 ,  II  no  mi  p ma p  chain 

D3DUSAGE  DYNAMIC,  II  was  0- -  kee  p  DYNAMIC! 

D3D  FORMAT, 

D3DPOOL  DEFAULT, 

&Sc ene  Tex) ; 
i  f  ( FAI  LEDf  hr) )  " 

return  false; 

II  get  interface  to  top  level  surface  of  Scene  Tex 
hr  =  Scene  Tex- >Get Surf aceLevel ( 0, &Scene  Surface); 
i  f(  FAI  LEDf  fir) ) 

return  false; 

II  create  100  reticle  textures  or  4  8x8  pallettes,  depending 
II  on  whether  image  size  is  512  or  256/128 
i  f  ( SceneSi  ze  ==  2  5  6  ) { 

II  create  100  individual  reticle  textures 
for  ( i  nt  i  =0;  i  <100;  i  ++)  { 

hr  =  D3DXCr  eat  eText  uref 
Devi  ce, 

SceneSi  ze,  SceneSi  ze, 

1 ,  II  no  mi  p ma p  chain 
0 ,  II  D3  DUS  AGE  DYNAMI  C,  //usage 
D3D  FORMAT,  " 

D 3 DP OO L  DEFAULT, 

&Ret  icle  T  e  x  [  i  ] ) ; 
i  f  (  FAI  LEDf  hr)  ) 

return  false; 


} 

else  { 


II  get  interface  to  top  level  surface  of  each  tex 

hr  =  Reticle  Tex[  i  ]- >Get Surf aceLevel ( 0,  &Ret i cl  e  Surfacefi ]); 

i  ffFAl  LEDfhrJ) 


II  create  4  reticle  pallette  textures 
for  ( i  nt  i  =0;  i  <4;  i  ++)  { 

hr  =  D3DXCr  eat  eText  ur  e( 

Devi  ce, 

SceneSi  z e * 8 ,  SceneSi  z e * 8 , 

1 ,  II  no  mi  p ma p  chain 

0 ,  II  D3  DUS  AGE  DYNAMIC,  //usage;  DYNAMIC  loads  faster 
D3D  FORMAT,  ' 

D  3  DP  OO  L  DEFAULT, 

St  R  e  t  icle  T  e  x  [  i  ] ) ; 
i  f  (  FAI  LEDf  hr)  ) 

return  false; 
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II  get  interface  to  top  level  surface  of  each  tex 

hr  =  Reticle  Tex[  i  ]  -  >Get Surf aceLevel  ( 0,  &Ret  i  cl  e  Surfacefi  ]); 

i  f  (  F  A I  LED(hrJ) 

return  false; 

II  create  4  dynamic  textures  in  SYSTEMMEM  to  build  pallettes 

hr  =  D3DXCr  eat  eText  ur  e( 

Devi  ce, 

SceneSi  z e * 8 ,  SceneSi  z e * 8 , 

1 ,  II  no  mi  p ma p  chain 

D3  DUS  AGE  DYNAMIC,  II  can't  be  DYNAMIC  and  RT 
D3D  FORMAT, 

D3DPOOL  SYSTEMMEM, 

&Pa I  Tex [ i  ] ) ; 
i  f  ( F  A I  LED(  hr)  )" 

return  false; 

II  get  interface  to  top  level  surface 

hr  =  Pal  T  e  x [ i ]- >Get Surf aceLevel ( 0,  &  P  a  I  Surffi]); 

i  f  ( FAI  LED(  hr) ) 

return  false; 

} 

} 

return  true; 

}  II  I  ni t  Ret i  cl  esAndScenef ) 

II . 

II  InitRenderTargets  hybrid!) 

II . : 

bool  I  ni t RenderTar get s_hybr i d( )  { 

HRESULT  hr  =  0; 

II  Set  Init  RTSize  --  the  size  of  RT  resulting  fromfirst  mul-reduce  op 
II  Set  Reducel  terati  ons  --  controls  how  many  times  the  16:1  reduce  will  be 
II  runafterthelst  mul-reducehasbeendone 

int  I  ni  t_RTSi  ze; 

i  f  ( SceneSi  ze  ==  64) { 

I  ni  t  RTSi  ze  =  2  5  6; 

Reducel t  er  at i ons  =  2; 

} 

else  if  ( SceneSi  ze  ==  128)  { 

I  ni  t  RTSi  ze  =  512; 

Reducel t  er  at i ons  =  3; 

} 

else  { 

I  ni  t  RTSi  ze  =  1  0  2  4; 

Reducel terati ons  =  3; 
s  t  a  r  1 1  =  0; 

endl  =  40; 

} 

II  Set  OutTexSi  ze- - 1  he  size  of  the  final  RT  we  will  get  our 
II  result(s)  from;  affects  Get  RT  Da  t  a  ( ) 

OutTexSi  ze  =  8*SceneSi  ze/(2*((i  nt)pow(4,  Reducelterati  ons))); 

OutPixels  =  OutTexSi  ze*OutTexSi  ze; 

//SET  Texlndex--  the  array  index  of  the  RT  Reduce  Surface!  ]  that  will  contain 
II  the  final  result;  affects  Get  RTDat  a( ) 

Texlndex  =  Reducel  t  er  at  i  ons- 1; 

II  create  initial  RT  (half  the  reticle  pallette  size  because  of  4:1  reduction) 
hr  =  Devi  ce- >Cr  eat  eText  ur  e(  I  ni  t  RTSize, Init  RTSize, 1, 

D3DUSAGE  RE NDE RTARGET,  D3DFMT  R32F, 

D3DPOOL  DEFAULT, &RT  Tex ,  0 ) ; / / D3 D  FORMAT 

i  f  ( FAI  LED(  hr) ) 

return  false; 

II  get  interface  to  top  level  surface 

hr  =  RT  Tex- >Get Surf aceLevel  ( 0,  &RT  Surface); 

i  f  ( FAI  LED(  hr) ) 

return  false; 

II  create  render  targets  for  reduction  op  (2  or  3) 
int  Size  =  I  ni  t  RTSi  ze/  4; 
for  (int  i  =0;  i  <Reducel  terati  ons;  i ++)  { 
hr  =  D3DXCr  eat  eText  ure( 

Devi  ce, 

Size,  Size, 

1 ,  I  I  no  mi  p ma p  chain 

D3DUSAGE  RE  NDE  RTARGET,  //can't  be  DYNAMIC  and  RT 
D3DFMT  R32F,  //  D3D  FORMAT 
D3DPOOL  DEFAULT,  " 

St  RT  Reduce  Tex[  i  ] ) ; 
i  f  ( FAI  LED(  hr)  I 

return  false; 

II  get  interface  to  top  level  surface 

hr  =  RT  Reduce  Tex[ i ]- >Get Surf aceLevel  ( 0,  &RT  Reduce  Surfacefi]); 
i  f  ( FAI  LED(  hr) )  " 

return  false; 

Size  /  =  4; 

} 

II  create  SYSTEMMEM  tex  to  send  result(s)  to 
hr  =  D3DXCr  eat  eText  ur  e( 

Devi  ce, 

OutTexSi  ze,  OutTexSi  ze, 

1 ,  II  no  mi  p ma p  chain 

D3DUSAGE  DYNAMIC,  II  try  dynamic  and  zero 
D3DFMT  R3  2  F ,  //  D3D  FORMAT, 

D3DPOOL  SYSTEMMEM," 

StOn  e  s  _  T e  x ) ; 
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i  f  ( F  A I  LED(  hr) ) 

return  false; 

II  get  interface  to  top  level  surface  of  Ones  Tex[] 
hr  =  Ones  Tex- >Get Surf aceLevel ( 0,  &Ones  Surface); 
i  f  ( FAI  LED[hr) ) 

return  false; 

II  ***  OFFSET  ARRAYS  FOR  16:1  REDUCE  --  sent  to  vs  and  ps  to  calculate  texcoords  for  adjacent 
/  /  pi  xel  s  i  n  t  he  4x4  bl  oc  k 

II  PixelSize  of  input  texture  to  first  reduce  op 
float  Pi  xel  Si  z  e  2  =  l/(fl  oat)  I  ni  tRTSi  ze; 

II  calculate  d  i  s  p  I  a  c  e  me  n  t  s 
for  ( i  nt  k  =  0;  k<Reducel  terati  ons;  k++){ 
for  ( i  nt  i  =0;  i  <4;  i  +  +  )  { 

f  or  ( i  nt  j  =0;  j  <4;  j  ++)  { 

of  f  set  [  k]  [  I  *4  +j  ]  =  D3  DXVECTOR2  ( Pi  xel  Si  z  e  2  *  ( f  I  oat )  j  ,  Pi  xel  Si  ze2*(fl  oat)i  ); 

}  } 

Pi  xel  Si  z e 2  *  =  4.  Of ; 

} 

return  true; 

}/  /  I  ni  t  RenderTar  get  s _ h y  b r i  d( ) 

II . 

II  Set  up( ) 

II  Initializes  geo  me  try,  renderstate,  calls 

II  I  ni  t  Render  Ta  r  get  s ,  InitShaders,  I  ni  t  Ret  i  cl  esAndScene 

II 

bool  Setup()  { 

HRESULT  hr  =  0; 

II .  DISABLE  unneeded  processing  . 

II  turn  off  Stencil  and  Culling 
hr  =  Devi  ce->SetDepthStenci  I  Surface! 

0); 

i  f  ( FAI  LED!  hr) ) 

return  false; 

hr  =  Devi  ce- >Set  Render  St  at  e( D3DRS  CULL  MODE ,  D3  DCUL  L  NONE); 
i  f  ( FAI  LED!  hr) ) 

return  false; 

II  disable  lighting 

Devi  ce  -  >Set  Render  St  at  e(  D3DRS_LI  GHTI  NG,  false); 

II .  create  geo  me  try  . 

D3DVERTEXELEMENT9  decl [ ] = 

{ 

{0,  0,  D3DDECLTYPE  FLOAT2,  D3DDECLMETHOD  DEFAULT,  D3 DDECL USAGE  POSITION,  0}, 

D3DDECL  END() 

}; 

II  declare  the  vertex  structure 

hr  =  Devi  c e- >Cr eat eVer t exDec I  a r at i on( dec  I ,  &m  pDecI); 
i  f  ( FAI  LED!  hr) ) 

return  false; 

II  create  VB  with  only  x,y  position 

hr  =  Devi  ce- >Cr eat eVer t exBuf f er(  56  *  s  i  z eof ( CUSTOMVE RTEX) ,  II  was  4 

D3DUSAGE  WRI  TEONLY, 

0, 

D3DPOOL  DEFAULT, 

&QuadVB” 

NULL) ; 


i  f  ( FAI  LED!  hr) ) 

return  false; 


oat 

1  eft 

= 

1.  OOf 

oat 

right 

= 

1.  OOf 

oat 

t  op 

= 

1.  OOf 

oat 

bottom 

= 

1.  OOf 

oat 

top  r  o  w2 

= 

0.  7 5 f 

oat 

t  op"r  ow3 

= 

0.  5 Of 

oat 

t  o  p"  r  o  w4 

= 

0.  2 5 f 

oat 

t  op"r  ow5 

= 

0.  OOf 

oat 

t  o  p"  r  o  w7 

=  - 

0.  5 Of 

oat 

t  o  p"  r  o  w8 

=  - 

0.  7 5 f 

oat 

bot "  r  o wl 

= 

0.  7 5 f 

oat 

bot "  r  o w2 

= 

0.  5 Of 

oat 

bot "  r  o w4 

= 

0.  OOf 

oat 

bot "  r  o w5 

=  . 

0.  2 5 f 

oat 

bot "  r  o w6 

=  . 

0.  5 Of 

oat 

bot "  r  o w7 

=  - 

0.  7 5 f 

CUSTOMVE  RTEX*  v; 

QuadVB-  >  L  o  c  k  (  0,  56  *  si  zeof  (  CUSTOMVERTEX) ,  (VOID**)&v,  0 ) ;  /  /  wa  s  4 
/ / quad  0  full  square 
II  left  bottom 
v[  0] .  x  =  I  ef t ; 
v  [  0  ] .  y  =  bottom;  /  /bottom 

II  left  top 
v[  1] .  x  =  I  ef t ; 
v[  1] .  y  =  top; 

II  right  bottom 
v  [  2  ] .  x  =  right; 
v  [  2  3 .  y  =  bottom;  /  /bottom 

II  right  top 
v  [  3  ] .  x  =  right; 
v[  3]  .  y  =  top; 

II  quad  1  r 1 -  5 
II  left  bottom 
v[  4]  .  x  =  left; 
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v [ 4 ] . y  =  bot_row5; // 5 

II  left  top 
v[  5] .  x  =  I  ef t ; 
v[  5]  .  y  =  top; 

II  right  bottom 
v  [  6  ] .  x  =  right; 
v [ 6 ] . y  =  bot_row5; // 5 

II  right  top 
v  [  7  ] .  x  =  right; 
v[  7]  .  y  =  top; 

/ / quad  2  r  2- 6 
II  left  bottom 
v[  8] .  x  =  I  ef t ; 
v  [  8  ]  .  y  =  b  o  t  _  r  o  w6 ; 

II  left  top 
v[  9] .  x  =  I  ef t ; 
v  [  9  3  .  y  =  t  o  p  _  r  o  w2 ; 

II  right  bottom 
v  [  1 0  ]  .  x  =  right; 
v [  1 0 3 .  y  =  b o t  _  r  ow6; 

II  right  top 
v  [  1 1  ]  .  x  =  right; 
v [  1 1 3  .  y  =  t  o p _ r  ow2; 

II  quad  3  r  3- 7 
II  left  bottom 
v[  12] .  x  =  I  ef t ; 
v [  1 2 3  .  y  =  bot_row7; 

II  left  top 
v[  13] .  x  =  I  ef t ; 
v [ 1 3  3  .  y  =  t  o  p _  r  o  w3 ; 

II  right  bottom 
v  [  14]  .  x  =  right; 
v [  1 4 3  .  y  =  b o t  _ r  ow7; 

II  right  top 
v  [  1 5  ]  .  x  =  right; 
v [ 1 5  3  .  y  =  t  o  p _  r  o  w3 ; 

/ / quad  4  r  4- 8 
II  left  bottom 
v  [  1 6  ]  .  x  =  left; 
v [  1 6 3 .  y  =  bottom; 

II  left  top 
v[ 17] .  x  =  I  ef t ; 
v [  1 7  3  .  y  =  t  o p _  r  ow4; 

II  right  bottom 
v  [  1 8  ]  .  x  =  right; 
v [  1 8 3 .  y  =  bottom; 

II  right  top 
v  [  1 9  ]  .  x  =  right; 
v [  1 9 3  .  y  =  t  o p _ r  ow4; 

II  quad  5  r  1- 6 
II  left  bottom 
v[ 20] .  x  =  I  ef t ; 
v [  2  0  3  .  y  =  b o t  _  r  ow6; 

II  left  top 
v[ 21] .  x  =  I  ef t ; 
v [  2 1 3  .  y  =  top; 

II  right  bottom 
v[  22]  .  x  =  right; 
v [  2  2  3  ■  y  =  b o t  _  r  ow6; 

II  right  top 
v[  23]  .  x  =  right; 
v [  2  3  3  .  y  =  top; 

/ / quad  6  r  2- 7 
II  left  bottom 
v  [  2  4  ]  .  x  =  left; 
v [  2  4  3  .  y  =  b o t  _  r  ow7; 

II  left  top 

v [  2 5 ]  .  x  =  I  ef t ; 

v [  2  5  3  .  y  =  t  o p _ r  ow2; 

II  right  bottom 
v  [  2  6  ]  .  x  =  right; 
v [  2  6  3  .  y  =  b o t  _  r  ow7; 

II  right  top 
v  [  2  7  ]  .  x  =  right; 
v [  2  7  3  .  y  =  t  o p _ r  ow2; 

II  quad  7  r  3- 8 
II  left  bottom 
v[ 28] .  x  =  I  ef t ; 
v [  2  8  3  .  y  =  bottom; 

II  left  top 
v[ 29] .  x  =  I  ef t ; 
v [  2  9  3  .  y  =  t  o  p _  r  o  w3 ; 
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II  right  bottom 
v  [  3  0  ]  .  x  =  right; 
v  [  3  0  j .  y  =  bottom; 

II  right  top 
v  [  3 1  ]  .  x  =  right; 
v[ 31] . y  =  t  o p _  r  o  w3 ; 

//quad  8  T1 
II  left  bottom 
v[ 32] .  x  =  I  ef t ; 
v[  32]  .  y  =  b o t  _ r  owl; 

II  left  top 
v[ 33] .  x  =  I  ef t ; 
v[  33]  .  y  =  top; 

II  right  bottom 
v  [  3  4  ]  .  x  =  right; 
v[  34]  .  y  =  b o t  _ r  owl; 

II  right  top 
v  [  3  5  ]  .  x  =  right; 
v[  35]  .  y  =  top; 

II  quad  9  B2 
II  left  bottom 
v  [  3  6  ]  .  x  =  left; 
v [  3 6 ]  .  y  =  bottom; 

II  left  top 
v[ 37] .  x  =  I  ef t ; 
v [  3 7 ]  .  y  =  t  o p _ r  ow7; 

II  right  bottom 
v  [  3  8  ]  .  x  =  right; 
v [  3 8 ]  .  y  =  bottom; 

II  right  top 
v  [  3  9  ]  .  x  =  right; 
v[  39]  .  y  =  t  o p _ r  ow7; 

//quad  10  T4 
II  left  bottom 
v  [  4  0  ]  .  x  =  left; 
v[ 40]  .  y  =  b o t  _ r  ow4; 

II  left  top 
v  [  4 1  ]  .  x  =  left; 
v[  41]  .  y  =  top; 

II  right  bottom 
v  [  4  2  ]  .  x  =  right; 
v  [  4  2  ]  .  y  =  b  o  t  _  r  o  w4 ; 

II  right  top 
v  [  4  3  ]  .  x  =  right; 
v[  43]  .  y  =  top; 

//quad  11  B 1 
II  left  bottom 
v  [  4  4  ]  .  x  =  left; 
v [  4 4 ]  .  y  =  bottom; 

II  left  top 

v  [  4  5  ]  .  x  =  left; 

v[ 45]  .  y  =  t  o p _ r  o w8 ; 

II  right  bottom 
v  [  4  6  ]  .  x  =  right; 
v [  4 6 ]  .  y  =  bottom; 

II  right  top 
v[  47]  .  x  =  right; 
v[  47]  .  y  =  t  o p _ r  ow8; 

//quad  12  B4 
II  left  bottom 
v  [  4  8  ]  .  x  =  left; 
v [  4 8 ]  .  y  =  bottom; 

II  left  top 

v  [  4  9  ]  .  x  =  left; 

v[ 49]  .  y  =  t  o p _ r  o w5 ; 

II  right  bottom 
v  [  5  0  ]  .  x  =  right; 
v [  5 0 ]  .  y  =  bottom; 

II  right  top 
v  [  5 1  ]  .  x  =  right; 
v[ 51]  .  y  =  t  o p _  r  o  w5 ; 

II  quad  13  T2 
II  left  bottom 
v [  5 2  ]  .  x  =  I  ef  t ; 
v [  5 2  ]  .  y  =  b o t  _  r  ow2; 

II  left  top 
v[ 53] .  x  =  I  ef t ; 
v[  53]  .  y  =  top; 

II  right  bottom 
v  [  5  4  ]  .  x  =  right; 
v[  54]  .  y  =  b o t  _ r  ow2; 

II  right  top 
v  [  5  5  ]  .  x  =  right; 
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v [  5 5 ]  .  y  =  top; 

QuadVB- > U n I  ock( ) ; 

II  set  vertex  declaration  (will  not  change  again) 

Devi  ce  -  >SetVertexDecl  arati  m  pDecI  ) ; 

i  f  ( SceneSi  ze  ==  256) { 

II  set  geometry  (will  not  change  again) 

Devi  ce- >Set St r  e a mS o u r  c e ( 0,  QuadVB,  0,  si  zeof  (  CUSTOMVERTEX) ) ; 

} 

return  true; 

}/  /  Set  u p ( ) 


pr i  vat  e: 

II . 

II  .  LOAD  I  NPUT  SCENE  . 

II . 

bool  Loadl  nput Scene) f I  oat  p  i  nput Ar r a y [ ] )  { 

RECT  SurfRect; 

Surf  Rect .left  =0; 

Sur  f  Rect . t  op  =0; 

Surf  Rect  .right  =  SceneSi  ze; 

Surf  Rect .  bot  t  om  =  SceneSi ze; 

HRESULT  hr  =  0; 

hr=  D3DXLoadSurf  a  c  e  F  r  omMemo  r  y  ( 

Scene  Surf  ace, 

0, 

0, 

p  i  nput  Ar  r  ay, 

D3 D  FORMAT, 

( 16*  SceneSi  ze) ,  II 16  for  4x32- bi  t ,  8  for  2  x  3  2  -  b  i  t ,  4  for  1  x  3  2  F  format 
0, 

&Sur f  Rect , 

D3DX  FI  LTER  NONE, 

0);  ' 

i  f  ( F  A I  LED(  hr) ) 

return  false; 
return  true; 

}/ / Loadl  nput  Scene) ) 

II . 

II  R  e  d  u  x ( ) 

II  16:1  REDUCE  OPERATI  ON 

II  Uses  files:  vs  16tapredux  2.txt  and  ps  16tapredux  2.txt 

II  7 7 7 7 

bool  R e d u x ( )  { 

HRESULT  hr  =  0; 

II  -  -  -  -  set  PS  and  VS  shaders 

Devi  ce- >SetVertexShader(VS2  16tapreduce); 

Devi  ce- >Set  Pi xel  Shader  (PS2"16tapreduce); 

II  initial  source  tex  is  result  of  maddreduce  op 
Devi  ce- >Set Text ur e(  0,  RT_Tex); 

for  (int  i  =  0;  i  <  Reducel  t  er  at  i  ons;  i ++)  { 

hr  =  Devi  ce- >Set RenderTarget (  0,  RT  Reduce  Surface[i ]); 
i  f  ( FAI  LED(  hr) ) 

return  false; 

if  (i  >0) 

Devi  ce- >Set Text ur e(  0,  RT_Reduce_Tex[ i - 1] ) ; 

II  set  VS  offset  constant  array 

hr  =  VS2  VSCT- >Set F I  oat Ar r ay (  Device, 

VS2  of f set  Ha n d I  e, 

( f  I  oat  *) &of  f  set [ i ] [ 0] , 

16  );//  2*8  f I  oats 

i  f  ( FAI  LED(  hr) ) 

return  false; 

II  set  PS  offset  constant  array 

hr  =  PS2  PSCT- >Set F I  oat Ar r ay (  Device, 

PS2  of f set  Ha n d I  e, 

(float*)  &of  f  set  [  i  ][8], 

16  ) ; II  2*8  f  I  oats 

i  f  ( FAI  LED(  hr) ) 

return  false; 

II  render--  do  16:1  reduction  on  source  image 

II  Devi  ce  -  >CI  e a r  ( 0,  0,  D3DCLEAR  TARGET,  0L,  0,  0) ; 

Devi  ce- >Begi  nScenef ) ; 

Devi  ce- >Dr  a  wP  r  i  mi  t  i  ve(  D3DPT  TRI  ANGLESTRI  P,  0,  2); 

Devi  ce- >EndScene( ) ; 

} 

return  true; 

}  /  /  Re d  u x ( ) 

II . 

II  bool  MAddReduceU 

II  for  ONEBYONE  ( Pal  ette)  approach 

II  uses  files:  vs  onebyone.txt  and  ps  onebyone.txt 

II  7  7 . 

bool  MAddReduce( i  nt  p  retl  ndexStart) { 

HRESULT  hr  =  0; 

II  -  -  -  -  set  PS  and  VS  shaders 

Devi  ce- >SetVertexShader(VSl  maddreduce); 

Devi  ce  -  >Set  Pi  xel  Shader  ( PSl_maddr  educe) ; 

II  set  VS  Pi xel  Si  ze  const 

II  PixelSizeX  &  Y  initialized  as  Global  constants 

hr  =  VS1_VSCT- >SetVector(  Devi  ce,  VS  1_ P i  xel  Si  zeHandl  e, 
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&D3  DXVECT  0R4(  f  Pi  xSi  zeX,  fPixSizeY,  l.Of,  1 .  Of )  ) ; 


i  f  ( F  A I  LED(  hr) ) 

return  false; 


II  ----  set  RT 

hr  =  De v i  c e- >Set Re nder Ta r get (  0,  RT  Surface); 
i  f  ( FAI  LED(  hr) ) 

return  false; 


Devi  ce  -  >Set  St  reamSourcef  0,  QuadVB,  0,  si  zeof  (  CUSTOMVERTEX) ) ; 

D3DVI  EWPORT9  vp; 

vp.Width  =  SceneSi  ze/2; 
vp.  Hei  ght  =  SceneSi  ze/2; 
vp.  Mi  nZ  =  0.  Of ; 
vp.  MaxZ  =  l.Of; 

De v i  c e- >Set Text u r e (  0,  Scene_Tex) ; // stage  0  =  input  scene 

for  ( i  nt  v  =0;  v  <5 ;  v++)  { 

for  ( i  nt  h  =0;  h<8;  h++)  { 

vp.  X  =  h*SceneSi  z e /  2; 
vp.  Y  =  v*SceneSi  z e /  2; 

Devi  ce-  >Set  Vi  ewpor  t ( &vp) ; 

II  set  sampler  1  with  reticle  image  for  maddredux  with  scene 

Devi  ce  -  >SetTexture(  1,  Reti  cl  e_Tex[(p_retl  ndexStart  +  (  v*8+h) )  %100]  ); 

II  render--  madd  scene  with  a  single  reticle  image 
Devi  ce- >Begi  nScenef ) ; 

Devi  ce- >Dr  a  wP  r  i  mi  t  i  ve(  D3DPT  TRI  ANGLESTRI  P,  0,  2); 

Devi  ce- >EndScene( ) ; 

} 

} 

II  set  streamsource  to  8x5  rectangle  r 1  -  r  5 ,  Quad  1 

Devi  ce- >Set St r  e a mS o u r  c e (  0,  QuadVB,  4*si  zeof ( CUSTOMVERTEX) ,  si  zeof ( CUSTOMVERTEX) ) ; 
return  true; 

}  II  MAddReduceO 


/  /  MAddReduce  hybr i d( ) 

II  For  128  and  256  input  scene  sizes. 

II  multiplies  input  scene  with  a  8x8  reticle  pallette  using  WRAPi  ng  when 

II  sampling  the  scene  and  does  4:1  reduction.  Scene  width  is  1/8  the  width  of  the  pallette. 
II  Adjusts  size  of  rendering  rectangle  to  cut  out  unneeded  calculations. 

II  This  rendering  rectangle  remains  set  for  the  reduction  op,  too. 

II 

II  Uses  files:  vs_bigtex.txt  and  ps_bigtex.txt 


bool  MAddReduce_hybr i  d( i  nt  p_retlndex){ 

HRESULT  hr  =  0; 
i  nt  r  ow  =  0; 
i  nt  col  =  0; 

int  diff  =  p  ret  Index;/  / =0 


//determines  which  of  the  4  pallettes  to  use, 

II  ensuring  40  contiguous  retiles  present  in  the  pallette 
/ / set  pal  I  et  t  e  as  t  ext  ur  e  st  age  0 
if  ( p  r  et I  ndex  <=  24) { 

Devi  ce- >Set  Text  ur  e(  0,  Reticle  T  e  x  [  0  ] ) ; 
row  =  p  retlndex/8; 
col  =  p ~ r  e 1 1  ndex %8 ; 

else  if  ( p  reti  n  d  e  x  <=4  9 )  { 

Devi  ce- >Set  Text  ur  e(  0,  Reticle  T  e  x  [  1  ] ) ; 
diff  =  p  retlndex-25; 
row  =  d  f  f  f  /  8; 
col  =  d  i  f  f  %8 ; 

} 

else  if  ( p  reti  ndex<=74)  { 

Devi  ce- >Set  Text  ur  e(  0,  Reticle  T  e  x  [  2  ] ) ; 
diff  =  p  retlndex-50; 
row  =  d i  Ff / 8; 
col  =  d  i  f  f  %8 ; 


else  { 

Devi  ce- >Set  Text  ur  e(  0,  Reticle  T  e  x  [  3  ] ) ; 
diff  =  p  retlndex-75; 
row  =  di  Ff /  8; 
col  =  d  i  f  f  %8 ; 

} 


/ / new  code 
st  art  1  =  diff; 
endl  =  di f f  +40; 

II 

i  f  (  col  ==  0) 

Devi  ce  -  >Set  St  reamSourcef  0,  QuadVB,  ( row+1) *4*si  zeof ( CUSTOMVERTEX) ,  si  z eof ( CUSTOMVE RTEX) ) ; 

else 

Devi  ce  -  >Set  St  reamSourcef  0,  QuadVB,  ( row+1+4)  *4*si  zeof  (  CUSTOMVERTEX) ,  s  i  z  eof  (  CUSTOMVERTEX) ) ; 
II  set  VS  and  PS 

Devi  ce- >SetVertexShader(VSl  maddreduce); 

Devi  ce  -  >Set  Pi  xel  Shader  ( PSl'maddr  educe) ; 

hr  =  VS1  VSCT- >Set  Vect  or ( Devi  ce,  VS1  Pi xel  Si zeHandl  e, 

&D3DXVECT O R 4 ( 1.  Of / ( f I  oat ) ( SceneSi  ze*  8) , 

1 .  Of  /  ( f  I  oat )  Sc  eneSi  z  e,  O.Of,  l.Of)); 

II  ---set  tex  stage  1  to  be  input  scene 
Devi  ce- >SetTexture(l, Scene_Tex); 

hr  =  De v i  c e- >Set Re nder Ta r get (  0,  RT  Surface); 
i  f  ( FAI  LED(  hr) ) 

return  false; 
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II  render:  multiply  input  scene  with  reticle  image  and  do  4:1  reduction 

II  Devi  ce  -  >CI  e a r  (  0,  0,  D3DCLEAR  TARGET,  0L,  0,  0) ; 

Devi  ce- >Begi  nScenef ) ; 

Devi  ce- >Dr  a  wP r  i  mi  t  i  ve(  D3DPT  TRI  ANGLESTRI  P,  0,  2 ) ;  /  /  wa s  0 
Devi  ce- >EndScene( ) ; 

return  true; 

}//  MAddReduce_hybri  d( ) 


II  Get  RTDat  a  hybri  d( ) 

II  retrieve  FINAL  data  from  lockable  render-to  surface 


void  Get  RTDat  a  hybri  dfdoubl  e  p  b[])  { 

HRESOLT  hr  =  0; 

I  nt  q; 

i  nt  col  ; 
i  nt  row; 

D3 DL OCKE D_  RE CT  I  oc  kedRect ; 

Devi  ce- >Get  RenderTarget  Data( RT  Reduce  Surf  a  c  e [ Texl  ndex] ,  Ones  Surface); 

Ones  Surface- >LockRect( &l ockedRect, 

0  ,  II  lock  entire  tex 

D3 DL OCK_ READONLY  ) ;  //flags 

II  D3DXVECTOR4*  imageData  =  (  D3  DXVECTOR4* )  I  ockedRect .  pBi  t  s; 
float*  imageData  =  (float*)  I  ockedRect.  pBi  ts; 

//perform  final  4:1  reduction  if  necessary  and  add  up  4  components  of  each  pixel 
if  (  Out TexSi z  e  >8 ) { 

for  ( i  nt  i  =  st  ar  1 1;  i  <e  n  d  1 ;  i  ++)  { 


row  =  i  /  8; 
col  =  i  %8 ; 
q  =  row*  32  +  col  *  2 ; 
p _ b [ i - st  art  1]  = 


i  mageDat  a 
i  mageDat  a 
i  mageDat  a 
i  mageDat  a 


q]  + 
q+1] 
q+16] 
q+17] ; 


+ 

+ 


} 

} 

else  { 

for  ( i  nt  i  =  st  ar  1 1;  i  <e  n  d  1 ;  i  ++)  { 

p_b[  i  -  st  art  1]  =i  mageDat  a[  i  ] ;  II. x  +  i  mageDat  a[  i  ].  y  +i  mageDat  a[  i  ].  z 

}  } 

Ones_Surf  ace- >U n I  ockRect ( ) ; 


return; 

}//  Get  RTDat  a _ h y b r i  d( ) 


/  /  Rel  e a s e ( )  and  Del  e t  e ( ) 

II  cleanup  functions 


tempi  ate<cl  ass  T>  void  Rel  e  a  s  e  (  T  t)  { 
i  f  (  t  )  { 

t  -  >Rel  e a s e (  ) ; 
t  =  0; 


tempi  ate<cl  ass  T>  void  Del  e  t  e  (  T  t){ 
i  f  (  t  ){ 

delete  t ; 
t  =  0; 

} 

} 


II  Cleanup!) 

II  releases  textures/surfaces/interfaces/devices  / me  mo  ry 
II  all  ocated  duri  ng  program 


void  Cl  eanup( ) 

//vertex  buffer  and  declaration 

Rel  e  a  s  e  <  I  Di  rect3DVertexBuffer9*>(  QuadVB) ; 

Rel  ease  <LPDI  RECT3DVERTEXDECLARATI  ON9>  (  m_  p  Dec  I  ) ; 

II  textures  and  surf  aces 

Rel  e  a  s  e  <1  Di  rect  3DSurface9*>(  Scene  Surface); 

Rel  e  a  s  e  <  I  Di  rect  3DText  ure9*>(  Scene_Tex) ; 

i  nt  n  =  4; 

if  ( SceneSi  ze==2  5  6)  n  =1 0  0 ; 
for  ( i  nt  t  =0;  t  <n;  t  ++)  { 

Rel  e  a  s  e  <1  Di  rect  3DSurface9*>(  Ret  i  cl  e  Surface[t]); 
Released  Direct3DTexture9*>(Reticle"Tex[t]); 

} 


//initial  RT 

Rel  e  a  s  e  <  I  Di  rect  3DText  ure9*>(  RT  Tex) ; 

Rel  e a s e < I  Di  rect3DSurface9*>(RT"Surface); 


for  (int  t  =  0;  t<Reducel  terati  ons;  t ++)  { 

Rel  e  a  s  e  <  I  Di  r  ect  3  DS  u  r  f  a  c  e  9  *  >(  RT  Reduce  Surface[t]); 
Rel  e a s e < I  Di  rect  3DText  ure9*>(  RT-Reduce-Tex[  t  ] ) ; 

} 


+  i  mageDat  a[  i  ].  w; 
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Re  I  e  a  s  e  <  I  Di  rect  3DText  ure9*>(  Ones  Tex); 

Rel  e a s e <1  Di  rect3DSurface9*>(Ones~Surface); 

II  PS  &  VS 

Rel  e  a  s  e  <1  Di  rect  3  D  P  i  xel  Shader9*>(  PS1  maddreduce); 

Rel  ease  <1  D3  DXCo  n  s  t  ant  Tab  I  e*>(  PS1  PSCT) ; 

Rel  e a s e <1  Di  rect3DVertexShader9*>[VSl  maddreduce); 

Rel  ease  <1  D3  DXCo  n  s  t  ant  Tab  I  e*>(  VS1  VSCT) ; 

Rel  e  a  s  e  <  I  Di  r  e  c  1 3  D  P  i  xel  Shader9*>(PS2  16tapreduce); 

Rel  ease <1  D3  DXCo  n  s  t ant  Tab  I  e*  >( PS2  PSCT) ; 

Rel  e  a  s  e  <  I  Di  rect3DVertexShader9*>[VS2  16t  apreduce) ; 

Rel  ease <1  D3  DXCo  n  s  t ant  Tab  I  e*  >( VS2_ VSCT) ; 

Devi  ce- >Rel  eas e( ) ; 

}//  CleanupO 

public: 

II . 

II  P  r  o  c  e  s  s ( ) 

II  user  interface  to  GPU  algorithm 

II  input:  reference  to  scene  image  array  variable 

II  input:  starting  index  in  reticle  pallette 

II  output:  void  (but  40  dot-product  results  are  loaded  to  user  array) 
void  Process!  i  nt  p  retlndex,  float  p  SceneAr  r  a  y  [  ] ,  double  p  b  [  ] )  { 

Loadl  nput Scene! p  SceneArray); 
i  f  ( SceneSi  ze  ==" 256) { 

II  do  one  at  a  time  algorithm 
MAdd  Reduc  e( p_  r  et I  ndex) ; 

el  se{ 

II  do  big  t  ex 

MAddReduce_hybri  d ( p  _  r  e  1 1  ndex); 

R  e  d  u  x  ( ) ; 

Get  RTDat  a  h y  b r  i  d ( p  b) ; 
return; 

}//  Process! ) 

II . 

II  upl  oadReti cl  e 

II . 

bool  upl  oadReti  cl  e(i  nt  p  index,  float  p  a  r  r  a  y  [  ] )  { 

HRESULT  hr  =  O' 

RECT  rect; 

RECT  srcRect; 
s  r  c  Rect .  t  op  =  0; 
srcRect.  bottom  =  SceneSize; 
s  r  c  Rect .  I  ef  t  =0; 
srcRect .  ri  ght  =  SceneSi ze; 

i f  ( SceneSi  ze  ==  256) { 
r  ect . I  ef  t  =0; 
r  ect .  r i ght  =  SceneSi  ze; 
r  ect .  t  op  =  0; 
r  ect .  bot  t  om  =  SceneSize; 

hr=  D3DXLoadSurf  a  c  e  F  r  omMemo  r  y  ( 

Reticle  Surface!  p  index], 

0, 

0, 

p  array, 

D3 D  FORMAT, 

(  16*  SceneSi  ze) ,  II 16  for  4x32- bi  t ,  8  for  2  x  3  2  -  b  i  t ,  4  for  1  x  3  2  F  format 
0, 

&s  r  c  Rect , 

D3DX  FI  LTER  NONE, 

0);  " 

i  f  ( F  A I  LED!  hr) ) 

return  false; 

} 

else  { 


if  (  (p  i  n  d  e  x  <=6  3 )  &&  (p  index  >=0)){ 

rect. top  =  SceneSi  ze*  ( P  index/8); 
rect. left  =  SceneSi  ze*(  p“i  ndex%8) ; 
rect. bottom  =  rect .  top+SceneSi  ze; 
rect.  right  =  rect.  I  eft+SceneSi  ze; 

hr=  D3DXLoadSurf  a  c  e  F  r  omMemo  r  y( 

Pal  Surf [ 0] , II  Ret i  cl  e  Surface! 0] , 

0,  " 

&rect,  II  dest  rect 
p  array, 

D3 D  FORMAT, 

( 16^  SceneSi  ze) ,  II 16  for  4x32- bi  t ,  8  for  2  x  3  2  -  b  i  t ,  4  for  1  x  3  2  F  format 
0, 

&s  r  c  Rect , 

D3DX  FI  LTER  NONE, 

0);  " 

i  f  (  F  A I  LED!  hr)  ) 

return  false; 

} 

if  Up  i  ndex>=25)  &&  (p  index<=88)){ 

rect. top  =~SceneSi ze* (  (p  i  ndex- 25) / 8) ; 
rect. I  eft  =  SceneSize*!  ( p'i  ndex-  25) %8) ; 
rect. bottom  =  r ect . t op+SceneSi  ze; 
rect.  right  =  rect.  I  eft+SceneSi  ze; 

hr=  D3DXLoadSurf  a  c  e  F  r  omMemo  r  y( 

Pal  Surf [ 1] , II  Ret i  cl  e  Surface!  1] , 

0,  " 

&rect,  II  dest  rect 
p  array, 

D3 D  FORMAT, 

( 16^  SceneSi  ze) ,  II 16  for  4x32- bi  t ,  8  for  2  x  3  2  -  b  i  t ,  4  for  1  x  3  2  F  format 
0, 
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&s  r  c  Re c  t , 

D3DX  FI  LTER  NONE, 

0);  " 

i  f  (  F  A I  LED(  hr)  ) 

return  false; 

} 

if  (  (((p  i  ndex+50)  %100)  >=0)  &&  (  ( ( p  i  ndex+50)  %1 0  0 )  <=63 ) )  { 
rect.top  =  SceneSi  ze*(  UP  i  ndex+50)%100)/8); 

rect.left  =  SceneSi  ze*  (  (( p"i  ndex+50) %100) %8) ; 

rect. bottom  =  rect . t  op+SceneST  ze; 
rect.  right  =  rect.  I  eft+SceneSi  ze; 

hr=  D3DXLoadSurf  a  c  e  F  r  omMemo  r  y  ( 

Pal  Surf [ 2] , II  Ret i  cl  e  Surface! 2] , 

0,  ' 

&r  ect , 
p  array, 

D3 D  FORMAT, 

( 16^  SceneSi  ze) ,  II 16  for  4x32- bi  t ,  8  for  2  x  3  2  -  b  i  t ,  4  for  1  x  3  2  F  format 
0, 

&s  r  c  Rect , 

D3DX  FI  LTER  NONE, 

0);  " 

i  f  (  F  A I  LED(  hr)  ) 

return  false; 

} 

if  (  (Up  i  ndex+25)  %100)  >=0)  &&  (  Up  i  ndex+25)  %1 0  0 )  <=63 ) )  { 
rect.top  =  SceneSi  ze*(  Up  i  ndex+25)%100)/8); 

rect.left  =  SceneSi  ze*  (  (( p"i  ndex+25) %100) %8) ; 

rect. bottom  =  rect . t  op+SceneST  ze; 
rect.  right  =  rect.  I  eft+SceneSi  ze; 


hr=  D3DXLoadSurf  a  c  e  F  r  omMemo  r  y( 

Pal  Surf [ 3] , II  Ret i  cl  e  Surface! 3] , 

0,  " 

&r  ect , 
p  array, 

D3 D  FORMAT, 

( 16^  SceneSi  ze) ,  II 16  for  4x32- bi  t ,  8  for  2  x  3  2  -  b  i  t ,  4  for  1  x  3  2  F  format 
0, 

&s  r  c  Rect , 

D3DX  FI  LTER  NONE, 

0);  " 

i  f  ( F  A I  LED(  hr) ) 

return  false; 


if  ( p  index  ==  99)  { 
for  ( i  nt  i 


=  0;  i  <4;  i  ++)  { 

Devi  ce- > U p d a t eText ur e( Pal  Tex [ i ] ,  Ret i  c I  e  T e x [ i ] ) ; 
Rel  e  a  s  e  <1  Di  r ec 1 3 DSu r f  ac e9f >( Pa  I  Surf  [  i  ] )  T 
Rel  e  a  s  e  <  I  Di  rect3DTexture9*>(  Pal  -  Tex[  i  ]); 


} 

return  true; 
}  II  upl  oadReti cl  e( ) 


i  nt  Get  Al  g( )  { 

i  f  ( SceneSi  ze  ==  256) 

return  2; / / one  by  one 

else 


return  1;  II  big  t  ex 


II . 

II  ~  Gpu() 

~Gpu()  { 

Cl  e a n u p ( ) ; 

}//  ~  Gpu()  DESTRUCTOR 


DESTRUCTOR 


}; 

#endi f  II  GP U_ CL  AS S_ H_ BY_ MAJ  _J  EFFERS 
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II 

II  file:  vs  bigtex.txt 
BY  MAJ  SEAN  j  EFFERS 


11  no’v  04  -  -’multiplies  lxl  scene  by  8x8  reticle  pallette,  then 
4:1  redux;  results  in  RT  that  is  quarter  sized  of 
sampling  of  scene  done  w/ wrapping 
PixelSize.x  =  1/ret  pallette  width 
Pi  xel  Si  ze.  y  =  1/  scene  wi  dt  h 
PixelSize.z  =  O.Of  (must!) 

oor  ds  for  PS 


II 

--  vs 

generates  8 

uni  f  or  m  f  1  oat  4  Pi  xel  Si  ze; 

II  structures 

struct 

VS_I  NPUT 

i 

}; 

f  1  oat  4  Pos 

POSI  Tl  ON; 

struct 

VS_  OUTP  UT 

{ 

f 1  oat  4  Pos 

POSI  Tl  ON; 

f 1  oat  2  Tex 

TEXCOORDO 

f 1  oat  2  Texl 

TEXCOORD1 

f  1  oat  2  T e x 2 

TEXCOORD2 

f  1  oat  2  T e x 3 

TEXCOORD3 

f  1  oat  2  Tex4 

TEXCOORD4 

f  1  oat  2  T e x 5 

TEXCOORD5 

f  1  oat  2  T e x 6 

TEXCOORD6 

f  1  oat  2  Tex7 

TEXCOORD7 

II  vertex  shader  function  (input  channels) 


VSOUTPUT  Mai  n ( VS_ I  NPUT  i  nput ) 

VSOUTPUT  output  =  ( VSOUTPUT) 0; 

out  put .  Pos .  xy  =  i  nput .  Pos .  xy;  /  /  +  Pi  xel  Si  ze.  xy; 

output.Pos.z  = 

out  put .  Pos . w  =  1.  Of ; 

output. Tex  =  float2(0.5f,  -  0 .  5  f )  *  input. Pos. xy  +  0 .  5  f .  x  x 
output. Texl  =  output. Tex  +  Pi  xel  Si  ze.  xz ; 
output.  T  e  x  2  =  output. Tex  +  Pi  xel  Si  ze.  zx; 
output.  T  e  x  3  =  output. Tex  +  Pi  xel  Si  ze.  xx; 

output. Tex4  =  8. Of  *  o  u  t put . Tex; 
output.  T  e  x  5  =  output. Tex4  +  Pi  xel  Si  ze.  yz; 
output.  T  e  x  6  =  output. Tex4  +  Pi  xel  Si  ze.  zy; 
output.  Tex7  =  output.  Tex4  +  Pi  xel  Si  ze.  yy; 
returnoutput; 


does 

pallette; 
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II  file:  ps  bigtex.txt 
/  /  depends  on  file:  vs  bigtex.txt 
II  BY  MAJ  SEAN  j  EFFERS  " 

II  11  nov  04  -  -  multiplies  lxl  scene  by  8x8  reticle  pallette,  then  does 
II  4:1  redux;  results  in  RT  that  is  quarter  size  of  pallette; 

II  sampling  of  small  texture  done  w/  wrapping 

II  27  dec  04  --  modified  to  have  AGRB32  in  and  R32F  out  with  dot  product 


II  gl  obal  s 
II  . 


sampler  Render  sampl  er ;  II  8x8  reticle  pallette  (big  texture) 
sampler  Rendersampl  erl;  II  scene  lxl  small  texture 


II  structures 
II  . 


struct 

{ 


}; 


struct 

{ 


}; 


PS_I  NPUT 

f  1  oat  2 

Tex  : 

TEXCOORDO 

f  1  oat  2 

Texl: 

TEXCOORD1 

f  1  oat  2 

Tex  2 : 

TEXCOORD2 

f  1  oat  2 

Tex  3 : 

TEXCOORD3 

f  1  oat  2 

Tex4: 

TEXCOORD4 

f  1  oat  2 

Tex  5 : 

TEXCOORD5 

f  1  oat  2 

Tex  6 : 

TEXCOORD6 

f 1  oat  2 

Tex  7 : 

TEXCOORD7 

PSOUTPUT 

f  1  oat  4 

cl  r  : 

COLOR;  II 

II  Pixel  Shader  (input  channel  s) :  out  put  channel 
II 


PS_OUTPUT  PSMai  n( PS_I  NPUT  i  nput ) 

P S_ OUTP UT  output  =  ( P S_ OUT P UT )  0; 

f  I  o  a  t  4  a  =  tex2D(  Rendersampl  er,  input. Tex); 
f  I  o  a  t  4  b  =  tex2D(  Rendersampl  er,  input. Texl); 
f  I  o  a  t  4  c  =  tex2D(  Rendersampl  er,  input.  T  e  x  2 ) ; 

f  I  o  a  t  4  d  =  t  ex  2  D(  Re  nder  s  a  mp  I  e  r ,  input.  T  e  x  3 ) ; 

f  I  o  a  t  4  e  =  tex2D(  Rendersampl  erl,  input. Tex4); 
f  I  o  a  t  4  f  =  tex2D(  Rendersampl  erl,  input.  T  e  x  5 ) ; 

f  I  o  a  t  4  g  =  tex2D(  Rendersampl  erl,  input.  T  e  x  6 ) ; 

f  I  o  a  t  4  h  =  tex2D(  Rendersampl  erl,  input.  T  e  x  7 ) ; 

output. clr  =  dot(a.e)  +  dot(b.f)  +  dot  ( c ,  g )  +  d  o  t  ( d ,  h ) ; 
/  /  a  *  e  +  b*f  +  c*g  +  d  *  h ; 

return  output; 
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II  file:  vs  onebyone.txt  (was  vs  experimental.txt) 

II  BY  MAJ  SEAN  j  EFFERS 

II  used  by:  GPU  combined. h  and  GPU  CLASS  ONEBYONE  R32F.h 
II  -  - 

II  17  oct  04--  use  PixelSize.y  for  T  e  x  2  -  4  components  instead  of  -,x 
II  27  dec  04  --  renamed  to  vs  onebyone.txt 

II  . : 

II  variables  that  are  provided  by  the  application 
II  . 

uni  f  or  m  f  I  oat  4  Pi  xel  Si  ze; 

II  structures 

struct  VS  I  NPUT 

{ 


f  1  oat  4 

Pos  : 

POSI  Tl  ON; 

VS_  OUTP  UT 

f  1  oat  4 

Pos  : 

POSI  Tl  ON; 

f  1  oat  2 

Tex  : 

TEXCOORDO; 

f  1  oat  2 

Tex  2 : 

TEXCOORD1 ; 

f  1  oat  2 

Tex  3 : 

TEXCOORD2 ; 

f  1  oat  2 

Tex4: 

TEXCOORD3; 

}; 


II  . 

II  vertex  shader  function  (input  channels) 

II  . 

VSOUTPUT  Mai  n ( VS_ I  NPUT  i  nput ) 

VS  OUTPUT  output  =  ( VS_ OUTPUT) 0 ; 

out  put .  Pos .  xy  =  input. Pos.xy  +  Pi  xel  Si  ze.  xy; 
output.Pos.z  =  0.5f; 
out  put .  Pos . w  =  1.  Of ; 

f  I  o  a  1 2  Tex  =  float2(0.5f,  -  0 .  5  f )  *  input. Pos.xy  +  0 .  5  f .  x  x 


out  put . Tex  =  Tex; 

output.  T  e  x  2  =  Tex  +  f  I  oat  2(  Pi  xel  Si  ze.  y,  0.0f);//use  ,y  instead  of  -,x 

output.  T  e  x  3  =  Tex  +  fl  oat2(0.  Of,  PixelSize.y); 

output.  Tex4  =  Tex  +  f  I  oat  2(  Pi  xel  Si  ze.  y,  Pi  xel  Si  ze.  y) ; 


return  output; 
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II  file:  ps  onebyone.txt  (was  ps  maddreduce  new.txt) 

II  PS  for  mult,  add,  reduce,  4 :  1 ;  ”  1  scene  t  ex  madd  with  6  reticles, 

II  then  result  of  last  madd  added  in 
II  BY  MAJ  SEAN  J  EFFERS 

II  1  oct  04  --  1st  ver.  had  2  samplers,  v2  had  8 

II  2  oct  04  --  modified  for  tl  *  s  u  m(  1 2  - 1 7 )  +  1 8 ,  where  t8  result  of 

II  last  pass;  eliminates  need  for  a  3rd  PS/VS 

II  5  oct  04  --  changed  to  add  only  single  pixel  from  1 8  (previous  result)  texture 

II  because  it  is  already  a  smaller,  reduced  texture 

II  14  oct  04  -  -  re  moved  output  struct 

II  17  oct  04  --  changed  data  type  to  f I  o  a  1 4  instead  of  vector 
II 

II  new  file  name:  ps  maddredce  new.txt 

II  27  oct  04  -  -  this"new  version  has  only  2  samplers  and  no  addback  of 

II  pr  evi  ous  r  es  ul t  s 

II  27  dec  --  changed  to  "dot"  to  accommodate  R32F 

II 

II  gl  obal  s 
II 


samp  I  er  Render  samp  I  er ; 
samp  I  er  Render  samp  I  er  2; 


II  . 

II  structures 

II  . 

struct  PS  I  NPUT 

{ 

f I  oat  2  Tex  :  TEXCOORDO; 
f I  oat  2  Tex 2  :  TEXCOORD1; 
f I  oat  2  Tex 3  :  TEXCOORD2 ; 
f I  oat  2  Tex4  :  TEXCOORD3; 

}; 

II  struct  PS  OUTPUT 

II  { 

II  f I  oat  4  c I r  :  COLOR;  II  was  COLORO 

II  }; 

II  . 

II  Pixel  Shader  (input  channel  s) :  out  put  channel 
II 

f I  oat  4  PS  Mai  n( PS_I  NPUT  i  nput)  : COLOR 

f  I  o  a  t  4  tla  =  t  ex  2  D(  Rende  r  s  a  mpl  er ,  i  nput .  Tex) ;  /  /  f  I  oat  4 
f  I  o  a  t  4  tlb  =  t  e  x  2  D(  Rende  r  s  a  mpl  er ,  input.  T  e  x  2 ) ; 
f  I  o  a  t  4  tic  =  t  ex  2  D(  Rende  r  s  a  mpl  er ,  input.  T  e  x  3 ) ; 

f  I  o  a  t  4  tld  =  t  e  x  2  D(  Rende  r  s  a  mpl  er ,  input.  T  e  x  4 ) ; 

f  I  o  a  t  4  1 2  a  =  t  e  x  2  D(  Rende  r  s  a  mpl  er  2 ,  input. Tex); 
f  I  o  a  t  4  1 2  b  =  t  e  x  2  D(  Rende  r  s  a  mpl  er  2 ,  input.  T  e  x  2 ) ; 

f  I  o  a  t  4  1 2  c  =  t  e  x  2  D(  Rende  r  s  a  mpl  er  2 ,  input.  T  e  x  3 ) ; 

f  I  o  a  t  4  1 2  d  =  t  e  x  2  D(  Rende  r  s  a  mpl  er  2 ,  input. Tex4); 


II  madd  src  (tl)  &  one  reticles  ( 1 2 ) 
f I  o  a  t  4  pi  =  d  o  t ( 1 1  a ,  1 2  a ) ;  //tla*t2a;  was  f I o  a  1 4 

f  I  oat  4  p2  =  dot ( t lb,  1 2b) ;  / / 1 1 b*  1 2 b; 

f  I  oat  4  p 3  =  dot ( t  lc ,  1 2c ) ;  II  t lc* t  2c ; 

f  I  oat  4  p4  =  dot ( t  Id,  1 2d ) ;  /  / 1 1 d *  1 2 d  ; 

/  /  new 

return  pi  +  p2  +  p3  +  p 4 ; 

} 
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II 

II  VS  1 6 1 apr  edux  2.txt 
II  BY"  MAJ  SEAN  j  EFFERS 

II  1  oct  04  --  modified  to  output  4  texcoords  for  block-of-4  reduction  op 
II  note  x-displacement  is  negated 

II  3  oct  04  --  trying  original  approach  to  see  if  reduce  error 
II  17  oct  04  --  changed  offset  array  size  to  [8]  from  [16] 

II  -  -  re  moved  redundant  PixelSize 

II  . 

II  variables  that  are  provided  by  the  application 
II  . 

/  /  f  I  oat  4  PixelSize; 

f  I  oat  2  of  f  s  et  [  8] ;  //was  16 


II  structures 

struct  VS  I  NPUT 

{ 

f  I  oat  4  Pos  :  POSI  Tl  ON; 

}; 

struct  VS  OUTPUT 

{ 

f  I  oat  4  Pos  :  POSI  Tl  ON; 

f I  oat  2  TexO  :  TEXCOORDO 
f I  oat  2  Tex  1  :  TEXCOORD1 
f I  oat  2  Tex 2  :  TEXCOORD2 
f I  oat  2  Tex 3  :  TEXCOORD3 
f I  oat  2  Tex4  :  TEXCOORD4 
f I  oat  2  Tex 5  :  TEXCOORD5 
f I  oat  2  Tex 6  :  TEXCOORD6 
f I  oat  2  Tex 7  :  TEXCOORD7 

}; 


II  . 

II  vertex  shader  function  (input  channels) 

II  . 

VSOUTPUT  Mai  n ( VS_ I  NPUT  i  nput ) 

VSOUTPUT  output  =  ( VSOUTPUT) 0; 

out  put .  Pos .  xy  =  input. Pos. xy  ;//+  f  I  oat  2(  -  of  f  set  [  1] .  x,  of  f  set  [  1] .  x) ;  /  /  Pi  xel  Si  ze.  xy; 

out  put .  Pos . z  =  0 .  5 f ; 
out  put .  Pos . w  =  1.  Of ; 

f  I  o  a  1 2  Tex  =  float2(0.5f,  -  0 .  5  f )  *  input. Pos. xy  +  0 .  5  f .  x  x  ; 

II  Tex  * =  f I  oat 2(  1.  Of ,  0.  5f ) ;  //added  float  to  test  subset  12  nov 

out  put . TexO  =  Tex  ; 
output. Tex  1  =  Tex  +  of  f  s  et [  1] ; 
output. Tex  2  =  Tex  +  offset! 2 ] ; 
out  put . Tex3  =  Tex  +  offset! 3 ] ; 
output. Tex4  =  Tex  +  offset! 4 ] ; 
out  put . Tex5  =  Tex  +  offset! 5 ] ; 
output. T  e  x  6  =  Tex  +  offset! 6 ] ; 
output.  T e x 7  =  Tex  +  offset[7]; 


return  output; 
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II  PS  16:  1  reduce 
II 

II  file:  ps  1 6 1 apr  edux  2.txt 
II  BY  MAJ  SEAN  j  EFFERS" 

II  3  Oct  04  -  trying  original  approach  to  see  if  reduce  error 
II  17  oct  04  -change  offset  array  size  to  [8]  from  [16] 


II  gl  obal  s 
II  . 


uniform  f  I  oat  2  of  f  set  [  8] ; 


samp  I  er  Render  sampler; 


II  structures 
II  . 


struct  PS  I  NPUT 

{ 

f I  oat  2  TexO 
f I  oat  2  Tex  1 
f I  oat  2  T e x 2 
f I  oat  2  T e x 3 
f I  oat  2  Tex4 
f I  oat  2  T e x 5 
f I  oat  2  T e x 6 
f I  oat  2  T e x 7 


TEXCOORDO 

TEXCOORD1 

TEXCOORD2 

TEXCOORD3 

TEXCOORD4 

TEXCOORD5 

TEXCOORD6 

TEXCOORD7 


}; 


II  Pixel  Shader  (input  channel  s) :  out  put  channel 
II 

f I  oat  4  PS  Mai  n( PS_I  NPUT  i  nput)  :  COLORO 

/ / PS  OUTPUT  output  =  ( P S_ OUT P UT )  0; 


f  I  oat  4  Col  or  Sum  =  0.  Of ; 


II  sample  first  8  taps  (first  2  rows  of  4x4  block) 


f  I  oat  4  c 0 
f I  oat  4  cl 
f  I  oat  4  c  2 
f  I  oat  4  c  3 


tex2D(  Rendersampl  er,  input 
t  e  x  2  D  (  Render  sampl  er ,  input 
tex2D(  Rendersampl  er,  input 
tex2D(  Rendersampl  er,  input 


TexO) ; 
Tex  1 ) ; 
Tex 2 ) ; 
Tex 3 ) ; 


f  I  oat  4  c 4 
f I  oat  4  c  5 
f  I  oat  4  c 6 
f  I  oat  4  c  7 


tex2D(  Rendersampl  er,  input 
t  e  x  2  D(  Render  sampl  er ,  input 
tex2D(  Rendersampl  er,  input 
tex2D(  Rendersampl  er,  input 


Tex4) ; 
Tex 5 ) ; 
T  e  x  6 )  ; 
Tex 7 ) ; 


II  add  col  or  val  ues  of  first  8  taps 


Color  Sum  +=  cO; 
Col  or  Sum  +=  cl; 
Color  Sum  +=  c  2 ; 
Color  Sum  +=  c  3 ; 
Col  or  Sum  +=  c 4 ; 
Color  Sum  +=  c  5 ; 
Color  Sum  +=  c  6 ; 
Color  Sum  +=  c  7 ; 


II  calculate  t  e  x  c  o  o  r  d  s  for  r  e  ma  i  n  i  n  g  8  taps 

f  I  o  a  1 2  T  a  p  8  =  input. TexO  +  o  f  f  s  e  t  [  0  ] ;  II  was  8-15 

f  I  o  a  1 2  T  a  p  9  =  input. TexO  +  offsetflj; 

f I  o  a  1 2  TaplO  =  input. TexO  +  o  f  f  s  e  t [ 2 ] ; 

f I  o  a  1 2  Tapll  =  input. TexO  +  o  f  f  s  e  t [ 3 ] ; 

f I  o  a  1 2  T  a  p  1 2  =  input. TexO  +  o  f  f  s  e  t [ 4 ] ; 

f I  o  a  1 2  T  a  p  1 3  =  input. TexO  +  o  f  f  s  e  t [ 5 ] ; 

f I  o  a  1 2  T  a  p  1 4  =  input. TexO  +  o  f  f  s  e  t [ 6 ] ; 

f I  o  a  1 2  T  a  p  1 5  =  input. TexO  +  o  f  f  s  e  t [ 7 ] ; 

II  sample  remaining  8  taps 

cO  =  t  e x 2 D(  Render  sampl  er ,  T a p 8 ) ; 

cl  =  t  e x 2 D(  Render  sampl  er ,  T a p 9 ) ; 

c  2  =  t  e  x  2  D(  Render  sampl  er ,  TaplO); 

c3  =  t  e  x  2  D(  Render  sampl  er ,  Tapll); 

c4  =  t  e x 2 D(  Render  sampl  er ,  T a p  1 2 ) ; 

c 5  =  t  e x 2 D(  Render  sampl  er ,  T a p  1 3 ) ; 

c6  =  t  e x 2 D(  Render  sampl  er ,  T a p  1 4 ) ; 

c7  =  t  e x 2 D(  Render  sampl  er ,  T a p  1 5 ) ; 


/ /  add  last  8  taps  to  sum 

ColorSum  +=  cO; 

Col  or  Sum  +=  cl; 

ColorSum  +=  c  2 ; 

ColorSum  +=  c  3 ; 

ColorSum  +=  c  4 ; 

ColorSum  +=  c  5 ; 

ColorSum  +=  c  6 ; 

Col  or  Sum  +=  c 7 ; 

return  ColorSum; 

} 
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II  win  CONSCAN.cpp  :  Defines  the  entry  point  for  the  application. 
II  BY  MA]  SEAN  JEFFERS  --tests  conscan  gpu  code 

#i  nc I  ude  " st  daf  x.  h" 

#i  nc  I  ude  "  wi  n  CONSCAN.  h" 

#def i  ne  MAX_ LOADST Rl  NG  100 

#i  nc  I  ude  <i  os  t  r  ea  m> 

#i  nc  I  ude  <f  s  t  r  e a m> 

#i  ncl  ude  <i  omani  p> 

#i  nc  I  ude  <c  mat  h  > 


#i  ncl  ude  "  GP  U_  CONSCAN.  h"  //combined. h  or  C  L  A  S  S  _  ON  E  B  Y  ON  E .  h  C  L  A  S  S  _  0  N  E  B  Y  ON  E  _  R  3  2  F .  h 


II  Gl  o  b  a  I  Var  i  abl  es : 

HI  NSTANCE  hi  nst ; 

TCHAR  szTi  1 1  e[  MAX  LOADSTRI  NG]  ; 

TCHAR  szWi  ndowCI  a  s  s  [  MAX  LOADSTRING]; 


const  i  nt  EXPER  =  100; 

const  i  nt  SCENE  SI  ZE  =  0; 

const  i  nt  S CE NE" S I  ZE  X  =  512; 

const  i  nt  S  CE  NE" S I  ZE" Y  =  512; 

const  i  nt  RETI  CL E  SIZE  =  128; 

c  o  n  s  t  i  n  t  WL  "  =  1 ; 

c  o  n  s  t  c  h  a  r  *  B  U  S  s  t  r  =  "  P  C I  -  e " ; 

const  char*  APRCH  st  r  =  "ATI"; 

const  i  nt  REPS  "  =2; 


IU  II  ill  ML  r\  L  r  j  -  L  , 

const  int  SIZE  SQ  =  SCENE  SIZE  X*  S  CE  NE  SIZE  Y; 
const  int  RET_5l  ZE_SQ  =  RETI  ClE_SI  ZE*RETI  CL  E_  S I  ZE; 

II  WL  3  pt  source  vars 
const  int  xmin  =  SCENE  SIZE  X/4; 
const  i  nt  y mi  n  =  SCENE'SI  ZE" Y/ 4; 
const  int  xmax  =  SCENE"SIZE:xmin; 
const  int  ymax  =  SCENE“SI  ZE-ymi  n; 

//initial  conditions 
int  oldx  =  xmin; 
int  oldy  =  ymin; 
int  del  x  =  - 1; 
int  xi  nc  =  - 1; 
i  nt  del  y  =  0; 
int  y  i  nc  =  - 1; 


II  current  instance 

II  The  title  bar  text 

II  the  ma i  n  wi  ndow  class  na me 


II  Forward  declarations  of  functions  included  in  this  code  module: 
voi  d  Updat  eScenef int  ,  float*); 

ATOM  MyRegi  sterCI  ass(  HI  NSTANCE  hlnstance); 

BOOL  I  ni  1 1  nst  a  n  c  e  (  HI  NSTANCE,  int); 

LRESULT  CALLBACK  Wn d P r  o c  (  HWND,  Ul  NT,  WPARAM,  L P ARAM)  ; 

LRESULT  CALLBACK  About  (  HWND,  Ul  NT,  WPARAM,  LPARAM); 


int  API  ENTRY  t  Wi  n  Ma  i  n  (  HI  NSTANCE  hlnstance, 

HI  NSTANCE  hPrevI  nst  ance, 
LPTSTR  I  pCmdLi  ne, 

i  nt  nCmdShow) 

{ 

float  ret  i  cl  e[  RET  SI  ZE  SQ] ; 
f  I  oat  scene!  SI  ZE  SQ] ;  " 
double  answer  [  4  0  T ; 


int  reti  ndex[40]; 

int  x  d  i  s  p  [  4  0] ; 

int  ydi  s  p[  40] ; 

doubl 

e 

t  i  me  s  [  R  E  P  S  ] ; 

doubl 

e 

t  i  mesl  og[ REPS] ; 

doubl 

e 

s  t  a  r  t  T  i  me ; 

doubl 

e 

endTi  me; 

doubl 

e 

sum  = 

doubl 

e 

mean 

doubl 

e 

var 

doubl 

e 

s  t  dev  = 

doubl 

e 

h  i  = 

doubl 

e 

1  ow  = 

doubl 

e 

SOS 

doubl 

e 

s  u  ml  o  g  = 

doubl 

e 

me  a  n  1  o  g 

doubl 

e 

va  r 1  og  = 

doubl 

e 

st  devl og 

doubl 

e 

s  os  1  og 

doubl 

e 

hi  1  og  = 

doubl 

e 

1  owl  og  = 

char 

WL  s  t  r  [  3  0 ] ; 

i  f  (  WL 

=  =  !){ 

s  t  r  c  p  y  (  WL  _  s  t  r ,  "  1  -  non-changing"); 
else  if  ( WL  ==  2)  { 

s  t  r  c  p  y  (  WL  _  s  t  r ,  "  2  -  fully-changing"); 


else  { 


strcpy(WL_str,  "3  -  mo v i  n g  p t  source"); 


II  instantiate  GPUConscan  obj  ect 

GpuCons  c  a  n  gpu(  hi  nst  ance,  nCmdShow,  SCENE  SIZE  X,  SCENE  SIZE  Y, 
RETI  CL E_ S I  ZE) ; 

//upload  reticles 

for  (int  i  =0;  i  <100;  i  ++)  { 

for  ( i  nt  j  =0;  j  <RET  SI  ZE  SQ;  j  ++)  { 
r  e t  i  cl  e [ j  ]  =  ( f I  oat )  i  ; 

gpu.  upl oadReti  cl  e( i ,  reti  cl  e) ; 
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//initialize  retlndex  and  x/y  disp  arrays 
f  or  ( i  nt  i  =  0;  i  <40  ;  i  ++)  { 
r  et  I  n d e x [ i  ]  =  39-  i  ; 
xdi  s p[  i  ]  =1; 
ydi  sp[  i  ]  =  1; 


st  art  Time  =  (doubl  e)ti  meGetTi  me(); 

II  run  al  gor  i  t  hm  1  00  0  x 
for  ( i  nt  i  =0;  i  <1  0  0  0;  i  ++)  { 

gpu.  Processf  ret  I  ndex,  scene,  answer, xdi sp,  ydi  sp); 
UpdateScenefWL,  scene); 

endTi  me  =  (double)  t  i  me  Get  Ti  me( ) ; 

double  time  Delta  =  (endTi  me-startTi  me)*0.  OOlf; 

double  ti  meDel  taLog  =  I  o  g  1 0  ( t  i  me  Delta); 

t  i  mes [  r  ep]  =  t  i  meDel  t  a; 

ti  mesl  og[  rep]  =  t  i  me  Del  t  a  Log; 

sum  +=  t  i  meDel  t  a; 

s  uml  og  +=  t  i  meDel  taLog; 

//calc  stats 

mean  =  sum/  ( doubl  e)  REPS; 
meanlog  =  suml  og  /  ( doubl  e)  REPS; 
hi  =0.0; 
hi  I  og  =  -  1  0  0  0.  0; 

I  ow  =  1  0  00.  0; 

I  owl  og  =  1  0  0  0.  0; 

for  ( i  nt  i  =0;  i  <REPS;  i  ++)  { 
if  ( t  i  mes [ i  j  >hi  ) 

hi  =  t  i  me s [  i  ] ; 
if  ( t  i  mes  [  i  ]  <1  ow) 

low  =  t  i  me  s  [  i  ] ; 

var  +=  pow(  (ti  mes[i  ]-mean),  2.  0)/(doubl  e)(REPS-l); 
sos  +=  pow(  t  i  mes  [  i  ] ,  2) ; 
if  (ti  mesl  og[i  ]>hi  I  og) 

hilog  =  t  i  mesl  o  g  [  i  ] ; 
if  (ti  mesl  og[i  ]<l  owl  og) 

I  owl  og  =  t  i  mes  I  og  [  i  ] ; 

varlog  +=  pow(  (ti  me  s  I  o  g  [  i  ]  -  meanlog),  2.  0)/(doubie)(REPS-l); 
soslog  +=  p  o  w(  t  i  mes  I  og[  i  ] ,  2) ; 

st  dev  =  sqr  t (  var ) ; 
st  devl og  =  sqrt  (varlog); 

//write  results  to  file 
char*  name  ="resul  ts/resul  ts  " ; 
char*  ext  =" .  dat " ; 
char  num[  4] ; 

i  t  o a (  EXPER,  num,  10) ; 
char  f  i  I  e  n  a  me  [  4  0  ] ; 
s  t  r  c  p  y  ( f  i  I  e  n  a  me ,  n  a  me ) ; 
strcatffi  I  ename,  num); 
strcatffi  I  ename,  ext) ; 

std:  :  ofstream  outFi  I  e(fi  I  ename,  std:  :  i  os:  :  app) ;  II  out 
if  (  !  out  F  i  I  e)  { 

::MessageBox(0,  "can't  open  results  file",  "GPU"  ,  0); 
exi  t ( 1) ; 

} 

outFi  I  e  <<"  exper  i  ment  #:  "  <<E X P E R <<'  \  n 1 

<<"  wor  kl  oad:  "  <<WL  s t  r  <<’  \  n 1 

<<"  bus :  "  <<BU5  s  t  r  <<'  \  n 1 

<<"approach:  "<<APRCH  s  t  r  <<'  \  n 1 

<<"  a  I  gor  i  t  hm:  "  <<al  gor  Ft  hm<<'  \  n1 

<<"  si  ze:  11  «SCENE  SI  ZE <<'  \  n1 

<<"  mean:  "  <<mean<<'  \  n' 

<<"  var  i  ance:  "  <<var  <<'  \  n1 

<<"  st  dev:  "  <<st  d e v <<'  \  n1 

<<"sum  of  sqrs:  "  <<s os <<'  \  n1 

<<"  low:  "  <<l  ow«'  \  n 1 

<<"  hi  :  "  < < h i  < < '  \  n 1 

<<"mean  log:  "  <<mea  n  I  og  <<'  \  n ' 

<<"vari  ance  log:  "  <<var  I  og<<' \  n1 

<  < "  s  t  dev  log  :  "  <<st  devl  o  g  <<'  \  n1 

<<"sumof  sqrs  log:  "  <<s  os  I  o  g  <<'  \  n1 

<<"  I  ow  I  og:  "  <<l  owl  o g < <'  \  n1 

<<"  hi  log:  "  <<hi  I  o  g  <<'  \  n1 

<<"  reps:  "  <<REPS<<'  \  n' 

<<"  data:  "  <<'  \  n1  ; 
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for  ( i  =0;  i  <REPS;  i  ++)  { 

out  F i  I  e < <t  i  mes [  i  ] ; 

if  (  !  ( (  i  + 1 )  %5 )  I  |  ( i  =  =  R  E  PS-  1) ) 

out  Fi  I  e<<'  \  t 1  <<"  \  n1  ; 

else  out  Fi  I  e<<'  \  t 1  ; 

} 

out  F  i  I  e<<' \  n1  <<"  dat  a  log:  "<<'\n'; 

for  ( i  =0;  i  <REPS;  i  ++)  { 

outFi  I  e  <  <s  t  d :  :  setpreci  si  o  n  ( 6 )  <<s  t  d :  :  s  e  t  w(  3 )  <<t  i  me  si  og[i  ]; 
if  (  !  ( (  i  + 1 )  %5 )  I  |  ( i  =  =  R  E  PS-  1) ) 

out  Fi  I  e<<'  \  t 1  <<"  \  n1  ; 

else  outFi  I  e  <<'  \  t 1  ; 

} 

outFi  I  e  < <'  \  n 1  <<"  a  ns  wer  s :  "  <<'  \  n1  ; 
for  ( i  =0;  i  <40;  i  ++)  { 

outFi  I  e < <s t  d :  :  setpreci  si  on(  9)  <<s t  d :  :  s et  w(  18)  << 

s t  d:  :  set  i  osf  I  ags ( st  d: : i os : : sci  ent i f i  c)  <<ans  wer [ i  ] ; 
if  M((  i  +D  %4) ) 

o  ut  F  i  I  e  <<'  \  n 1  ; 

} 

o  ut  F  i  I  e  <<'  \  n '  ; 


MSG  msg; 

HACCEL  hAccel Tabl  e; 

II  Initialize  global  strings 

LoadStri  ng(  hi  nstance,  IDS  APP  TITLE,  szTitle,  MAX  LOADSTRI  NG) ; 

LoadStri  ng(  hi  nstance,  I  DC'WI  N“  CONS  CAN,  szWi  ndowCI  ass ,  MAX  LOADSTRI  NG) ; 
My  Re  g  i  sterCI  a  s  s  ( hi  nstance!; 

II  Perf  or  m  appl  i  cat  i  on  initialization: 
if  (  !  I  ni  tl  nstance  (hlnstance,  nCmdShow)) 

{ 

return  FALSE; 

} 


hAccel  Table  =  LoadAccel  eratorsfhl  nstance,  ( LPCTSTR)  I  DC_WI  N_  CONSCAN) ; 

II  Mai  n  message  loop: 

while  (  Get  Message!  &msg,  NULL,  0,  0)) 

if  (  !  Tr  a  ns  I  at  e  Ac  c  el  er  at  or  (  ms  g.  h  wnd,  hAccel  Tabl  e,  &msg)) 

Trans  I  at  e  Me  s  s  a  g  e  (  &msg) ; 

Di  spat  chMessage!  &msg) ; 

}  } 


return  ( i  nt )  ms  g .  wPa  r  a m; 


II  FUNCTION:  Updat  eScenef ) 


void  UpdateScene(  i  nt  p  WL,  float*  p  scene)  { 
if  (p.wl  ==  i) r 
return; 

} 

if  (  p  _  WL  ==  2)  { 

f  or  ( i  nt  j  =  0;  j  <  SI  ZE  SQ;  j  ++)  { 
p _ s c e n e [ j  ]  +=  I.  Of ; 

return; 

} 

else  { 

i  nt  x  =  del  x  +  ol  dx; 

i  nt  y  =  del  y  +  ol  dy; 

if  (  ( x  <x  mi  n )  |  I  ( x>xmax)  )  { 

x  =  o  f  d  x ; 

xi  nc  =  -  xi  nc; 
del  x  =  del  x+xi  nc; 
dely  =  dely+yinc; 
y  +=  del  y; 

if  (  ( y  <y  mi  n )  |  I  ( y  >y  ma  x )  )  { 

y  =  ol dy ; 
yi  nc  =  -  yi  nc; 
del  x  +=  xi  nc; 
dely  +=  yi  nc; 
x  +=  del  x; 

} 


i  nt  i  ndex  =  y *  SCE NE  SI ZE  X  +  x; 

i  nt  i  ndexol  d  =  ol  dy*  SCE  NE  SI ZE  X  +  ol  dx; 

p  scene! i  ndexol  d]  =  0.  Of ; " 

p"scene[  index]  =  1.  Of ; 

of  dx  =  x ; 

ol  dy  =y; 

return; 

} 

} 


II  FUNCTION:  My  Re  g  i  s  t  e  r  Cl  a  s  s  (  ) 

II 

II  PURPOSE:  Registers  the  window  class. 

II 

II  COMMENTS: 

II  This  function  and  its  usage  are  only  necessary  if  you  want  this  code 

II  to  be  compatible  with  Win32  systems  prior  to  the  1  Regi  sterCI  assEx1 

II  function  that  was  added  to  Windows  95.  It  is  important  to  call  this  function 

II  so  that  the  application  will  get  'well  formed1  small  icons  associated 

II  wi  t  h  i  t . 

II 

ATOM  MyRegi  sterCI  ass(HI  NSTANCE  hlnstance) 

WNDCLASSEX  wcex; 
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wcex. cbSi  ze  =  si  zeof  (  WNDCLASSEX) ; 

wcex.  style  =  CS  HREDRAW  I  CS  VREDRAW; 

wcex.  I  pf  nWndProc  =  (  WNDPROC)  WndPr  oc; 

wc  ex .  c  bCI  s  Ext  r  a  =  0; 

wcex.  cbWndExtra  =  0; 

wcex.  hi  nst  ance  =  hi  nst  ance; 

wcex.  hi  con  =  Loadl  con(  hi  nst  ance,  (LPCTSTR)IDI  WINCONSCAN); 

wcex.hCursor  =  LoadCur  sor  (  NULL,  I  DC  ARROW)  ; 

wcex.  hbr  Background  =  (  HBRUS  H)  (  COL  OR  WINDOW+1); 

wcex.  I  pszMenuName  =  (LPCTSTR)IDC  Wf N  CONSCAN; 

wcex.  I  pszCI  assName  =  s z Wi  n d o wC lass; 

wcex.hlconSm  =  Loadl  con(  wcex.  hi  nst  ance,  ( LPCTSTR)  I  Dl  _SMALL) ; 

return  Regi  sterCI  assEx(  &wcex) ; 

II 

II  FUNCTION:  I  ni  1 1  nst  a  n  c  e  (  HANDLE,  int) 

II 

II  PURPOSE:  Saves  instance  handle  and  creates  main  window 
II 

II  COMMENTS: 

II 

II  In  this  function,  we  save  the  instance  handle  in  a  global  variable  and 

II  create  and  display  the  main  pr  ogr  am  wi  ndow. 

II 

BOOL  I  ni  1 1  nst  a  n  c  e  (  HI  NSTANCE  hlnstance,  int  nCmdShow) 

{ 

HWND  hWnd; 

hlnst  =  hlnstance;  II  Store  instance  handle  in  our  global  variable 

hWnd  =  CreateWi  ndow(  szWi  ndowCI  ass,  szTitle,  WS  OVERLAPPEDWI  NDOW, 

C  W_  USEDEFAULT,  0,  C  W_  USEDEFAULT,  0,  NULL,  NOLL,  hlnstance,  NULL); 

if  (ihWnd) 

{ 

return  FALSE; 

} 

ShowWi  ndowf  hWnd,  nCmdShow); 

Update  Wi  ndow(hWnd); 

return  TRUE; 

} 


II  FUNCTION:  Wn  d  P  r  o  c  (  HWND,  unsigned,  WORD,  LONG) 

II 

II  PURPOSE:  Processes  messages  for  the  main  window. 


II  WM  COMMAND 
II  WM"  PAINT 
II  WM" DESTROY 


process  the  application  menu 

Paint  the  main  window 

post  a  quit  message  and  return 


LRESULT  CALLBACK  WndProcfHWND  hWnd,  Ul  NT  message,  WP ARAM  wParam,  L P ARAM  IParam) 

int  wml  d,  wmEvent ; 

P A I  NTSTRUCT  ps; 

HDC  hdc ; 

swi  t  ch  (message) 

case  WM  COMMAND: 

wml  d  =  L OWOR D(  wPa  r  a m)  ; 
wmEvent  =  H I  WOR D(  wParam)  ; 

II  Parse  the  menu  selections: 
s  wi  t  c  h  (  wml  d) 

{ 

case  I  DM  ABOUT: 

'  Di  al  og Box (  hi  nst ,  (  L PCTSTR)  I  DD  ABOUTBOX,  hWnd,  (  DL GP ROC)  Abo ut ) ; 
break; 

case  I  DM  EXI T: 

DestroyWi  ndow(  hWnd) ; 
break; 

d ef  a  u I  t : 

return  Def  Wi  ndowPr  oc  ( hWnd,  message,  wParam,  IParam); 

break; 

case  WM  PAI  NT: 

hdc  =  Begi  nPai  nt  ( hWnd,  &ps) ; 

II  TODO:  Add  any  drawing  code  here... 

EndPai  nt(hWnd,  &ps); 
break; 
case  WM  DESTROY: 

Post Qui  t Message! 0) ; 
break; 

default: 

return  Def  Wi  ndowPr  oc  ( hWnd,  message,  wParam,  IParam); 

return  0 ; 

} 

II  Message  handler  for  about  box. 

LRESULT  CALLBACK  About  (  H  WN  D  h  Dl  g,  Ul  NT  message,  WPARAM  wParam,  L  P  ARAM  IParam) 

swi  t  ch  (message) 

case  WM  I  NI  TDI  ALOG: 

return  TRUE; 
case  WM  COMMAND: 

'  if  (  LOWORD(  wParam)  ==  I  DOK  ||  LOWORD(  wPa  r  a  m)  ==  I  DCANCEL) 

EndDi  al  og(  h Dl  g,  LOWORD(  wPa r  a m) ) ; 
return  TRUE; 

} 

break; 

} 

return  FALSE; 
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II  file:  GPU  CONSCAN.  h 
II 

II  by:  Maj  Sean  J  ef  f  er  s 

II  requires  external  files: 

II  GPU  UTI  LI TY.  h 
II 

II  source/ps  CONSCAN.  t  xt 
II  sour c  e / vs"CONSCAN. t xt 
II  source/ vs-16t  apredux  2.txt 
II  source/ ps-16t  apredux-2. txt 
II 

II  27  dec  04  --  modified  old  ONEBYONE  to  use  non-packed  R 3 2 F  textures  throughout 
II  --  this  is  expected  to  be  basis  for  CONSCAN 

II  2  j  an  05  --  renamed  R 3 2 F  to  GPU  CONSCAN.  h;  changed  to  use  CONSCAN  vs  and  ps 
II  --  border  color  not  supported;  clamping  is 

#i  f  ndef  GPU  CLASS  H  BY  MAJ  j  EFFERS 
#def  i  ne  GPU~CLASS~H~BY~MAJ  j  EFFERS 

#i  ncl  ude  <d 3 d x 9 .  h> 

#i  ncl  ude  "  GP U_ UTI  LI TY.  h" 

#i  ncl  ude  <s  t  d  I  i  b .  h  > 

#i  ncl  ude  <cst  r  i  ng> 

#i  ncl  ude  <c  mat  h > 


--  contains  namespace  d  3  d  utility  functions  I  n  i  t  D3  D( ) 
GPU  WndProc  CALLBACK  and  Gpu  WndClass  defini 

-  -  PS  used  by" MAddReducef ) 

--  VS  used  by  MAddReducej) 

--  vertex  shader  used  by  R e d u x ( ) 

-  -  pi xel  shader  used  by  Reduxf ) 


II .  CONSTANTS . 

#def  i  ne  GPU  Wl  NDOW  Wl  DTH  1024 
#def  i  ne  GPU'WI  NDOW'HEI  GHT  768 

#def i  ne  D3 D~ F ORMAT"  D3DFMT  R32F  /  /  A32B32G32R32F 

#def i  ne  ST Rf DE  4  '  II  16 

II . 


class  GpuConscan  { 
pr i  vat  e: 

HI  NSTANCE 
i  nt 

I  Di  r  ect  3 D De v i  c e 9 * 

II  const  i  nt 
const  i  nt 
const  i  nt 
const  i  nt 

II I  nt 
//float 
//float 

II  D3DXVECTOR4 
i  nt 

I  ong 

II I  nt 
i  nt 

i  nt 
bool 


hi  nst ; 
nCmdShow; 

Devi  c e; 

SceneSi  ze; 

SceneSi  zeX; 

SceneSi  zeY; 

Ret  i  cl  eSi  ze; 

ScenePi  xel  s ; 
f  Pi  xSi  z eX; 
f  Pi  xSi  zeY; 

Dat  a  A  r r  a  y [  2  048*  2  0  4  8  ] ; 
Out TexSi  z e; 

Out  Pi  xel  s ; 

Vi  ewpor  t  Size; 

Reducel terati  ons; 

Texl  ndex; 

Dual  RT; 


/  /  VS1 

I  Di  r  ect  3  D V  e  r t  exShader  9*  VS1  maddreduce; 


I  D3 DXCo ns  t  a  nt  Ta bl  e* 
D3  DXHANDL  E 
D3  DXHANDL  E 
D3  DXHANDL  E 


VS1  VSCT ; 

VSl'Pi  xel  Si  zeHandl  e; 

VS  1  “ Di  spl  acement  Handl  e; 
VSTAspect  Handl  e; 


II PS1 

I  Di  r  ect  3 D P i  xel  Shader  9* 

I  D3  DXCo  ns  t  a  nt  Ta  bl  e* 

/  /  VS2 

I  Di  r  ect  3DVer  t  exShader  9* 
I  D3  DXCo  ns  t  a  nt  Ta  bl  e* 

D3  DXHANDL  E 


PS1  maddreduce; 
PSTPSCT; 

VS2  16tapreduce; 
VS2" VSCT ; 

VS2I of  f set  Handl  e; 


II PS2 

I  Di  r  ect  3 D P i  xel  Shader  9* 
I  D3  DXCo  ns  t  a  nt  Ta  bl  e* 

D3  DXHANDL  E 
D3  DXHANDL  E 


PS2  16tapreduce; 
PS2" PSCT; 

P S 2“ of  f set  Handl  e; 
P  S  2 “ mu  I  Handl  e; 


II  PARAMETERS  PASSED  TO  PS  &  VS 
D3DXVECTOR2  of f set [ 4] [  16] ; 


II  VERTEX  BUFFER  &  DECL 

LPDI  RECT3 DVE RTEXDECL ARATI  ON9  m  pDecI; 

I  Di  r ect 3 DV e r t exBuf f er 9*  QuadVB; 


II  TEXTURES  &  SURFACES 
I  Di  r ec t 3 DText u r e9*  Scene  Tex; 

I  Di  rect3DSurface9*  Scene~Surf  ace; 


I  Di  r  ec  1 3  DText  u  r  69*  Reticle  T  e  x  [  10  0] ; 

IDirect3DSurface9*  Reti  cl  e“Surface[  100] ; 


IDirect3DTexture9*  RT  Tex; 

I  Di  r  ec 1 3 DSu r  f  a c e9*  RT“Surf  ace; 


II 

II 


I  Di  rect  3DText  ure9* 

I  Di  rect3DSurface9* 

I  Di  rect  3DText  ure9* 

I  Di  rect3DSurface9* 

I  Di  rect  3DText  ure9* 

I  Di  rect3DSurface9* 

II  t  r  a  n  s  f  o  r  ma  t  i  o  n  ma  t  r  i  c  e  s 


RT  Reduce  Tex[ 4] ; 
RT"Reduce~Surf  a c e [ 4] ; 
RT"Reduce~Tex2[ 3] ; 
RT“Reduce“Surf a  c  e  2 [ 3] ; 

Ones  Tex; 

Ones“Surf ace; 


mWo  rid; 
mVi  ew; 
mPr  oj  ; 


D3 DXMATRI  X 
D3 DXMATRI  X 
D3 DXMATRI  X 


II  .  STRUCTS 

struct  CUSTOMVE  RT  EX 


t  i  on 
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FLOAT 

FLOAT 


x ; 
y; 


public: 

//constructor 

GpuConscan(  HI  NSTANCE  p  hlnst,  int  p  nCmdShow,  int  p  SceneSizeX, 
inf  p  SceneSi  zeY,  int  p  ReticleSize) 

:  hi  nst  (p  hlnst),  nCmdShow(p  nCmdShow),  SceneSi  zeX(  p  SceneSizeX), 
SceneSi  ze?(  p  SceneSi  zeY) ,  Ret  i  cl  eSi  ze(  p_ Re t  i  cl  eSi  ze)  ll  1 2 

Devi  ce  =0; 


III  Pi  xSi  zeX 

= 

-  1.  Of 

/  (float)  SceneSi  ze; 

Ilf  Pi  xSi  zeY 

= 

1.  Of  / 

(float)  SceneSi  ze; 

/  /  VS1 

VS1  maddreduce 

= 

0; 

VSl'VSCT 

= 

0; 

VSl'Pi  xel  Si  zeHandl  e 

= 

0; 

II  PSl 

PS1  maddreduce 

= 

0; 

PSl'PSCT 

= 

0; 

//V  52 

VS2  16t  apreduce 

= 

0; 

VS  2" VSCT 

= 

0; 

VS2~of  f  set  Handl  e 

= 

0; 

II PS2 

PS2  16t  apreduce 

= 

0; 

PS2" PSCT 

= 

0; 

P S 2" of  f set  Handl  e 

= 

0; 

II  vertex  buffer  pti 

QuadVB 

= 

0; 

i  f  (  !  d  3  d :  :  I  n  i  t  D3  D(  hi  nst,  nCmdShow, 

GPU  Wl  NDOW  Wl  DTH,  GPU  Wl  NDOW  HE  I  GHT, 
{  "  " 

:  :  MessageBoxf  0,  "  I  ni  t  D3D( )  -  F  A I  LED" 


true,  D3  DDE VTYP  E_  HAL ,  ^Device)) 
0,  0); 


i  f ( !  Set  up( ) )  { 

:  :  MessageBoxf  0,  "SetupU  ■  F  A I  LED",  0,  0); 


Dual  RT  =  false; 

II  modular  actions  depending  on  algorithm 
I  ni t  Shader  s () ; 

I  ni t Ret i  cl  esAndScene  OneByOnef); 

I  ni t  RenderTargets_hyBri d( ) ; 

}// Gpu()  CONSTRUCTOR 

pr i  vat  e: 


II  I  ni t  Shader  s ( ) 

II  creates  &  compiles  shaders 


bool  I  ni  tShadersf )  { 

HRESULT  hr  =  0; 

II  ***  PS1 

I  D3 DXBuf f er*  PSBuf f er  =  0; 

I  D3  DXBuf  f  e  r  *  errorBuf f er  =  0; 

hr  =  D3 DX C o mp i  I  eShader  Fr  omFi  I  e( 

" source/ ps  CONSCAN.txt", 

0 , 

0, 

"PS Main",  II  entry  point  function  name 
"ps  2  0", 

D3DXSHADER  SKI  PVALI  DATI  ON, // DEBUG,  II  I  D3DXSHADER  SKI  POPTI  Ml  ZATI  ON 
&PSBuf  fer," 

&er  r  or  Buf  f  er , 

&  P  S 1 _ PSCT) ; 

II  output  any  error  messages 
iff  er  r  or  Buf  f  er  ) 

{ 

::  MessageBoxf 0,  ( char*) error  Buffer- >Get  Buff  er  Poi  nt  erf ) ,  0,  0); 

Rel  e  a  s  e  <1  D3DX  Buffer*  >(errorBuffer); 

} 

i  f  ( F  A I  LEDf  hr) ) 

{ 

::  MessageBoxf  0,  "  PS1- - D3  DXCo  mp  i  I  eShader  Fr  omFi  I  e( )  -  FAILED",  0,  0); 
return  false; 

} 


II  create  pixel  shader 

hr  =  Devi  ce- >Cr eat ePi  xel  Shader ( 

(  DWORD*)  PSBuffer  -  >GetBufferPoi  nterf), 

&PSl_maddreduce) ; 

i  f  ( F  A I  LEDf  hr) ) 

{ 

:  :  MessageBoxfO,  "CreatePi  xel  Shader  PS1  -  FAILED",  0,  0); 
return  false; 

} 


Rel  e  a  s  e  <1  D3  DXBuf  f  er  *  >(  PSBuffer) ; 

II  ***  PS2 

I  D3 DXBuf f er*  PS2Buf f er  =  0; 

I  D3  DXBuf f er*  errorBuf f  e  r  2  =  0; 
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hr  =  D3 DX C o mp i  I  eShader  Fr  omFi  I  e( 

"source/ps  16tapredux  2.txt'1, 

0, 

0, 

"PS Main",  II  entry  point  function  name 
"ps  2  0", 

D3DXSHADER  SKI  PVALI  DATI  ON, // DEBUG,  II  I  D3DXSHADER  SKI  PVALI  DATI  ON  OPTI  Ml  ZATI  ON 
&PS2  Buf  f  er  7 
&er  r  or  Buf  f  er  2, 

&PS2_  PSCT) ; 

II  output  any  error  messages 
i  f  (  er  r  or  Buf  f  er  2  ) 

{ 

::  MessageBoxf 0,  ( char*) errorBuffer2- >Get  BufferPoi  nter( ) ,  0,  0); 

R  e  I  e  a  s  e  <  I  D3DXBuffer*>(  errorBuffer2) ; 

} 

i  f  ( F  A I  LED(  hr) ) 

{ 

::  MessageBoxf  0,  "  PS2- - D3  DXCo  mp  i  I  eShader  Fr  omFi  I  e( )  -  FAILED",  0,  0); 
return  false; 

} 

II  create  pixel  shader 

hr  =  Devi  ce- >Cr eat ePi  xel  Shader ( 

(  DWORD* )  PS 2 Buf f er- >Get  Buf f erPoi  nt er( ) , 

&PS2_16t  apr  educe) ; 

i  f  ( F  A I  LED(  hr) ) 

{ 

::MessageBox(0,  "CreatePi  xel  Shader  PS2  -  FAILED",  0,  0); 
return  false; 

} 

Re  I  e  a  s  e  <1  D3  DXBuf  f  er*>(  PS2  Buf  f  er) ; 

II  ***  VSl 

I  D3 DXBuf  f  e r  *  VSBuf f er  =  0; 

I  D3  DXBuf  f  e  r  *  errorBuff e  r  3  =  0; 

hr  =  D3 DX C o mp i  I  eShader  Fr  omFi  I  e( 

" source/ vs  CONSCAN. t  xt " , 

0, 

0, 

"Main",  II  entry  point  function  name 
'  v  s  2  0 " , 

D3DXSHADER  SKI  PVALI  DATI  ON,  /  /  DEBUG,  II  I  D3DXSHADER  SKI  POPTI  Ml  ZATI  ON 
&VSBuffer," 

&er  r  or  Buf  f  er  3, 

&V  S 1  _  V  S  C  T )  ; 

II  output  any  error  messages 
i  f  (  er  r  or  Buf  f  er  3  ) 

{ 

::  MessageBoxf 0,  ( char*) errorBuffer3- >Get  BufferPoi  nter( ) ,  0,  0); 

Rel  e a s e < I  D3DXBuffer*>(errorBuffer3); 

} 

i  f  ( F  A I  LED(  hr) ) 

{ 

::  MessageBoxf  0,  "  VSl- - D3  DXCo  mp  i  I  eShader  Fr  omFi  I  e( )  -  FAILED",  0,  0); 
return  false; 

} 

II  create  vertex  shader 

hr  =  Devi  ce- >CreateVertexShader( 

(  DWORD*)  VSBuf fer- >Get  BufferPoi  nter(), 

&VSl_maddreduce); 

i  f  ( F  A I  LED(  hr) ) 

{ 

::  MessageBoxf  0,  11  Cr  eat  eVer  t  exShader  VSl  -  FAILED",  0,  0); 
return  false; 

} 

Rel  e a s e <1  D3 DXBuf  f  er  *>(  VSBuf  f  er ) ; 

II  ***  VS2 

I  D3  DXBuf f er*  VS2  Buf  f  er  =  0; 

I  D3  DXBuf f er*  errorBuf f e  r  4  =  0; 

hr  =  D3 DX C o mp i  I  eShader  Fr  omFi  I  e( 

"source/vs  16tapredux  2.txt", 

0, 

0, 

"Main",  II  entry  point  function  name 
'  v  s  2  0 " , 

D3DXSHADER  SKI  PVALI  DATI  ON,  /  /  DEBUG,  II  I  D3DXSHADER  SKI  POPTI  Ml  ZATI  ON 
&VS2  Buffer; 

&er  r  or  Buf  f  er  4, 

&VS2_ VSCT) ; 

II  output  any  error  messages 
i  f  (  er  r  or  Buf  f  er  4  ) 

{ 

::  MessageBoxf 0,  ( char*) errorBuffer4- >Get  BufferPoi  nter( ) ,  0,  0); 

Rel  e a s e < I  D3DXBuffer*>(errorBuffer4); 

} 

i  f  ( F  A I  LED(  hr) ) 

{ 

::  MessageBoxf  0,  "  VS2- - D3  DXCo  mp  i  I  eShader  Fr  omFi  I  e( )  -  FAILED",  0,  0); 
return  false; 

} 

II  create  vertex  shader 

hr  =  Devi  ce- >Cr eat eVert exShader ( 

(  DWORD*)  VS 2 Buf f er- >Get Buf f er Poi  nt er ( ) , 

&VS2_ 1 6 1 apr  educe) ; 
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i  f  ( F  A I  LED(  hr) ) 

{ 

:  :  MessageBoxfO,  " Cr  eat  eVer  t  exShader  VS2  -  FAILED",  0,  0); 
return  false; 

} 

Re  I  e  a  s  e  <1  D3DXBuffer*>(VS2Buffer) ; 

II .  get  VS1  pixelsize  constant  handle 

VS1  Pi  xel  Si  zeHandl  e  =  VS1  VSCT- >Get  Const  ant  ByName(  0,  "Pi  x e I  Si  z e " ) ; 

VS  1  ”  Di  spl  acement  Handl  e  =  VS  1  VSCT- >Get  Const  ant  B  y  N  a  me  ( 0,  "  Di  spl  acement ") ; 
V  S 1 “ As pect Handl  e  =  VS1_VSCT  -  >Get  Const  ant  ByName( 0, "Aspect " ) ; 

II  get  PS2  and  VS2  const  handles 

VS2  offsetHandle  =  VS2  VSCT- >Get  Const  ant  By  Name(  0,  "offset"); 
PS2“offsetHandl  e  =  PS2~PSCT- >Get Const  ant B  y  N  a  me  ( 0,  "offset"); 

//vertex  decl  and  set  stream  source  were  here,  moved  to  Setup 
return  true; 

}//  I  ni t  Shader  s ( ) 


II . 

II  I  ni  t Ret  i  cl  esAndScene  OneByOne() 

II  loads  reticle  images  into  GPU,  creates  reticle  and  scene  surfaces 
II  and  textures  in  GPU  memory 

II . 

II  FOR  TESTI  NG  PURPOSES  ONLY 

II  load  the  100  reticle  textures,  [0..99]  with  sample  data: 

II  places  whole  value  equal  to  texture  index  into 
II  each  pixel  of  the  texture; 

bool  I  ni t Ret i c I  esAndScene  OneByOne(){ 

II . : . 

II  create  scene  texture  and  surface 

II . 

HRESULT  hr  =  0; 

hr  =  D3DXCr  eat  eText  ur  e( 

Devi  ce, 

SceneSizeX,  SceneSi  zeY,  //was  SceneSize  for  both 

1 ,  II  no  mi  p ma p  chain 

D3DUSAGE  DYNAMIC,  II  was  0- -  keep  DYNAMIC! 

D3DFMT  R3 2 F , 

D3DPOOL  DEFAULT, 

&Sc ene  Tex) ; 
i  f  ( FAI  LED(  hr) )  " 

return  false; 

II  get  interface  to  top  level  surface  of  Scene  Tex 
hr  =  Scene  Tex- >Get Surf aceLevel  ( 0,  &Scene  Surface); 
i  f  (  FAI  LED(fir)  ) 

return  false; 

//generate  100  reticle  textures  (half  the  scene  width  for  CONSCAN) 
for  ( i  nt  t  =0;  t  <100;  t  ++)  { 

hr  =  D3DXCr  eat  eText  ur  e( 

Devi  ce, 

ReticleSize,  ReticleSize,  II  was  SceneSi  ze/2 
1 ,  II  no  mi  p ma p  chain 
0, II  usage 
D3DFMT  R32F, 

D3DPOOL  DEFAULT, 

&Ret i cl  e  Tex[ t ] ) ; 
i  f  ( FAI  LED(  hr) ) 

return  false; 

II  get  interface  to  top  level  surface  of  each  tex 

hr  =  Reticle  Tex[ t ]- >Get Surf aceLevel  ( 0,  &Ret  i  cl  e  Surface[t]); 

i  f(FAI  LED(hrJ) 

return  false; 

} 

return  true; 

} 

II . 

II  InitRenderTargets  hybrid() 

II . : 

bool  I  ni t RenderTar get s_hybr i d( )  { 

HRESULT  hr  =  0; 
int  Init  RTSize; 

II  set  initial  RT  size 
/ /  scene  can  be  1024,  512  or  256 
II  ret i  cl  e  can  be  512,  256  or  128 
if  (ReticleSize  ==  1 2 8 ) {  //was  SceneSize  ==  256 
Init  RTSize  =  512;  //scenesi  ze/4  *  8 
Reducel terati ons  =  3; 

} 

else  if  (  Ret  i  cl  eSi  ze  ==  2  5  6  )  { 

I  ni  t  RTSi  ze  =  1  0  2  4; 

Reducel terati ons  =  3; 

} 

else  { 

I  ni  t  RTSi  ze  =  2  04  8; 

Reducel t  er  at i ons  =  4; 

} 

//SET  OutTexSi  ze 

II  the  size  of  the  final  RT  we  will  get  our 
II  result  from 
II  affects  Get  RTDat  a( ) 

OutTexSi  ze  =  8*Reti  cl  eSi  ze/(2*((i  nt)pow(4,  Reducel  terati  ons)));  /  /was  4*SceneSi  ze 
//changed  for  CONSCAN 

OutPixels  =  OutTexSi  ze*OutTexSi  ze; 

//SET  Texl  ndex 


144 


II  the  array  index  of  the  RT  Reduce  Surface!]  that  will  contain 
II  the  final  result;  affects  Get  RTDat  a( ) 

Texl  ndex  =  Reducel t er at i  ons- 1; 


II  create  initial  RT  (half  the  reticle  pallette  size) 

hr  =  Devi  ce- >Cr  eat  eText  ur  e(  I  ni  t  RTSize.lnit  RTSi  ze,  1,  D3DUSAGE  RENDERTARGET, 

D3DFMT  R32F  ,  D3DPOOL  DEFAULT, &RT  Tex, 0 ) ;  / / D3D  FORMAT 

i  f  ( FAI  LED(  hr) ) 

return  false; 

II  get  interface  to  top  level  surface 

hr  =  RT  Tex- >Get Surf aceLevel ( 0,  &RT  Surface); 

i  f  ( FAI  LED(  hr) ) 

return  false; 

II  create  render  targets  for  reduction  op  (2  or  3) 
i  nt  Size  =  I  ni  t  RTSi  ze/  4; 
for  (int  i  =0;  i  <Reducel  terati  ons;  i ++)  { 
hr  =  D3DXCr  eat  eText  ur  e( 

Devi  ce, 

Size,  Size, 

1 ,  I  I  no  mi  p ma p  chain 
D3DUSAGE  RENDERTARGET,  / / c  0 u I  d 
D3DFMT  R  3  2  F ,  /  /  D3D  FORMAT, 

D3DPOOL  DEFAULT,  ' 

&RT  Reduce  Tex[ i  ] ) ; 
i  f  ( FAI  LED(  hr)  I 

return  false; 


be  DYNAMI  C,  but  not  DYNAMI  C  and  RT 


II  get  interface  to  top  level  surface 

hr  =  RT  Reduce  Tex[ i ]- >Get Surf aceLevel ( 0,  &RT  Reduce  Surface[i ]); 
i  f  ( FAI  LED(  hr) )  " 

return  false; 


Size 


/=  4; 


II  create  SYSTEMMEM  tex  to  send  result  to 
hr  =  D3DXCr  eat  eText  ur  e( 

Devi  ce, 

OutTexSi  ze,  OutTexSi  ze, 

1 ,  II  no  mi  p ma p  chain 

D3DUSAGE  DYNAMIC,  II  usage  could  be  DYNAMIC,  but  not  DYNAMIC  and  RT 
D3DFMT  R  3  2  F ,  /  /  D3D  FORMAT, 

D3DPOOL  SYSTEMMEM," 

&Ones  Tex); 

i  f  ( FAI  LED(  hr) ) 

return  false; 

II  get  interface  to  top  level  surface  of  Ones  Tex[] 
hr  =  Ones  Tex- >Get Surf aceLevel ( 0,  &Ones  Surface); 
i  f  ( FAI  LED[hr) ) 


return  false; 


II  ***  OFFSET  ARRAYS  FOR  16:1  REDUCE 


II  PixelSize  of  input  texture  to  first  reduce  op 
float  Pi  x  e  I  Si  z  e  2  =  1/ ( f  I  oat )  I  ni  t  _  RTSi  ze; 

II  calculate  d  i  s  p  I  a  c  e  me  n  t  s 
for  (int  k  =  0;  k<Reducel  terati  ons;  k++){ 
for  (int  i  =0;  i  <4;  i  ++)  { 

f  or  ( i  nt  j  =0;  j  <4;  j  ++)  { 

off set[k][i  *4  +j  ]  =  D3  DXVECTOR2  ( Pi  xel  Si  z  e  2  *  ( f  I  o  a  t )  j  ,  Pi  xel  Si  ze2*(fl  oat)i  ); 

}  } 

Pi  xel  Si  z e 2  *  =  4.  Of ; 

} 


return  true; 

}/  /  I  ni  t  RenderTar  get  s _ h y  b r i  d( ) 


II  Set  up( ) 

II  Initializes  geo  me  try,  renderstate,  calls 

II  I  ni  t  Render  Ta  r  get  s ,  InitShaders,  I  ni  t  Ret  i  cl  esAndScene 


bool  Set  up( )  { 

HRESULT  hr  =  0; 

II .  DISABLE  unneeded  processing  . 

II  turn  off  Stencil  and  Culling 
hr  =  Devi  ce- >Set Dept hSt enci I  Surf  a c e ( 

0); 

i  f  ( FAI  LED(  hr) ) 

return  false; 

hr  =  Devi  ce- >Set Render  St  at e(  D3DRS  CULLMODE,  D3DCULL  NONE); 
i  f  ( FAI  LED(  hr) ) 

return  false; 

II  disable  lighting 

Devi  ce  -  >Set  Render  St  at  e(  D3DRS_LI  GHT I  NG,  false); 

II .  create  geo  me  try  . 

D3DVERTEXELEMENT9  decl [ ] = 

{ 

{0,  0,  D3DDECLTYPE  FLOAT2,  D3DDECLMETHOD  DEFAULT,  D3 DDECL USAGE  POSITION,  0}, 
D3DDECL  E ND(  )  “  ~  " 

}; 

II  declare  the  vertex  structure 

hr  =  Devi  c e- >Cr eat eVer t exDec I  a r at i on( dec  I ,  &m  pDecI); 
i  f  ( FAI  LED(  hr) ) 

return  false; 

II  create  VB  with  only  x,y  position 

hr  =  Devi  ce- >Cr eat eVer t exBuf f er(  56  *  s i  z eof ( CUSTOMVERTEX) ,  II  was  4 

D3DUSAGE  WRI  TEONLY, 

0, 
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D3DP00L  DEFAULT, 
&Qu  a  d VB7 
NULL  )  ; 


i  f  ( F  A I  LED(  hr) ) 

return  false; 


oat 

1  eft 

= 

1.  OOf 

oat 

right 

= 

1.  OOf 

oat 

t  op 

= 

1.  OOf 

oat 

bottom 

= 

1.  OOf 

oat 

top  r  o  w2 

= 

0.  7 5 f 

oat 

t  op"r  ow3 

= 

0.  5 Of 

oat 

t  o  p"  r  o  w4 

= 

0.  2 5 f 

oat 

t  op"r  ow5 

= 

0.  OOf 

oat 

t  o  p"  r  o  w7 

=  - 

0.  5 Of 

oat 

t  op"r  ow8 

=  - 

0.  7 5 f 

oat 

bot "  r  o wl 

= 

0.  7 5 f 

oat 

bot "  r  o w2 

= 

0.  5 Of 

oat 

bot "  r  o w4 

= 

0.  OOf 

oat 

bot "  r  o w5 

=  . 

0.  2 5 f 

oat 

bot "  r  o w6 

=  . 

0.  5 Of 

oat 

bot "  r  o w7 

=  - 

0.  7 5 f 

CUSTOMVE RTEX*  v; 

QuadVB-  >  L  o  c  k  (  0,  56  *  si  zeof  (  CUSTOMVERTEX) ,  (VOID**)&v,  0 ) ;  /  /  wa  s  4 

II  quad  0  full  square 
II  left  bottom 
v [  0 ] .  x  =  I  ef t ; 
vjoj.y  =  bottom;  /  /bottom 

II  left  top 
v[  1] .  x  =  I  ef t ; 
v[  1] .  y  =  top; 

II  right  bottom 
v  [  2  ] .  x  =  right; 
v  [  2  ] .  y  =  bottom;  /  /bottom 

II  right  top 
v  [  3  ] .  x  =  right; 
v[  3]  .  y  =  top; 

//quad  1  r 1  -  5 
II  left  bottom 
v[  4]  .  x  =  left; 
v  [  4  ]  .  y  =  bot_row5;//5 

II  left  top 
v[ 5] .  x  =  I  eft; 
v[  5]  .  y  =  top; 

II  right  bottom 
v  [  6  ] .  x  =  right; 
v [ 6  3 . y  =  bot_row5; // 5 

II  right  top 
v  [  7  ] .  x  =  right; 
v[  7]  .  y  =  top; 

/ / quad  2  r  2- 6 
II  left  bottom 
v[  8] .  x  =  I  ef t ; 
v  [  8  3  ■  y  =  b  o  t  _  r  o  w6 ; 

II  left  top 
v[  9] .  x  =  I  ef t ; 
v  [  9  3  ■  y  =  t  o  p  _  r  o  w2 ; 

II  right  bottom 
v  [  1 0  ]  .  x  =  right; 
v [  1 0 3 .  y  =  b o t  _  r  ow6; 

II  right  top 
v  [  1 1  ]  .  x  =  right; 
v [  1 1 3  .  y  =  t  o p _ r  ow2; 

II  quad  3  r  3- 7 
II  left  bottom 
v[  12] .  x  =  I  ef t ; 
v [  1 2 3  .  y  =  bot_row7; 

II  left  top 
v[  13] .  x  =  I  ef t ; 
v [ 1 3  3  .  y  =  t  o  p _  r  o  w3 ; 

II  right  bottom 
v  [  14]  .  x  =  right; 
v [  1 4 3  .  y  =  b o t  _  r  ow7; 

II  right  top 
v  [  1 5  ]  .  x  =  right; 
v [ 1 5  3  .  y  =  t  o  p _  r  o  w3 ; 

/ / quad  4  r  4- 8 
II  left  bottom 
v  [  1 6  ]  .  x  =  left; 
v [  1 6 3 .  y  =  bottom; 

II  left  top 
v[ 17] .  x  =  I  ef t ; 
v [  1 7  3  .  y  =  t  o p _  r  ow4; 

II  right  bottom 
v  [  1 8  ]  .  x  =  right; 
v [  1 8 3 .  y  =  bottom; 

II  right  top 
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v  [  1 9  ]  .  x  =  right; 
v [  19]  .  y  =  t  o p _ r  ow4; 

II  quad  5  r  1- 6 
II  left  bottom 
v [  2 0  ]  .  x  =  I  ef  t ; 
v[  20]  .  y  =  b o t  _ r  ow6; 

II  left  top 
v[ 21] .  x  =  I  ef t ; 
v[  21]  .  y  =  top; 

II  right  bottom 
v[  22]  .  x  =  right; 
v [  2 2 ]  .  y  =  bot_row6; 

II  right  top 
v[  23]  .  x  =  right; 
v[  23]  .  y  =  top; 

/ / quad  6  r  2- 7 
II  left  bottom 
v  [  2  4  ]  .  x  =  left; 
v[  24]  .  y  =  b o t  _ r  ow7; 

II  left  top 

v [  2 5 ]  .  x  =  I  ef t ; 

v[ 25]  .  y  =  t  o p _  r  ow2; 

II  right  bottom 
v  [  2  6  ]  .  x  =  right; 
v[  26]  .  y  =  b o t  _ r  ow7; 

II  right  top 
v  [  2  7  ]  .  x  =  right; 
v[  27]  .  y  =  t  o p _ r  ow2; 

/ / quad  7  r  3- 8 
II  left  bottom 
v[ 28] .  x  =  I  ef t ; 
v [  2 8 ]  .  y  =  bottom; 

II  left  top 
v[  29]  .  x  =  left; 
v[ 29]  .  y  =  t  o p _ r  o w3 ; 

II  right  bottom 
v  [  3  0  ]  .  x  =  right; 
v [  3 0 ] .  y  =  bottom; 

II  right  top 
v  [  3 1  ]  .  x  =  right; 
v[ 31] . y  =  t  o p _  r  o  w3 ; 

//quad  8  T1 
II  left  bottom 
v[ 32] .  x  =  I  ef t ; 
v [  3 2 ]  .  y  =  b o t  _ r  owl; 

II  left  top 
v[ 33] .  x  =  I  ef t ; 
v[  33]  .  y  =  top; 

II  right  bottom 
v  [  3  4  ]  .  x  =  right; 
v[  34]  .  y  =  b o t  _ r  owl; 

II  right  top 
v  [  3  5  ]  .  x  =  right; 
v[  35]  .  y  =  top; 

//quad  9  B2 
II  left  bottom 
v  [  3  6  ]  .  x  =  left; 
v [  3 6 ]  .  y  =  bottom; 

II  left  top 
v[ 37] .  x  =  I  ef t ; 
v [  3 7 ]  .  y  =  t  o p _ r  ow7; 

II  right  bottom 
v  [  3  8  ]  .  x  =  right; 
v [  3 8 ]  .  y  =  bottom; 

II  right  top 
v  [  3  9  ]  .  x  =  right; 
v[  39]  .  y  =  t  o p _ r  ow7; 

//quad  10  T4 
II  left  bottom 
v  [  4  0  ]  .  x  =  left; 
v[  40]  .  y  =  b o t  _ r  ow4; 

II  left  top 
v  [  4 1  ]  .  x  =  left; 
v[  41]  .  y  =  top; 

II  right  bottom 
v  [  4  2  ]  .  x  =  right; 
v  [  4  2  ]  .  y  =  b  o  t  _  r  o  w4 ; 

II  right  top 
v  [  4  3  ]  .  x  =  right; 
v[  43]  .  y  =  top; 

II  quad  11  B 1 
II  left  bottom 
v  [  4  4  ]  .  x  =  left; 
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v [  4 4 ]  .  y  =  bottom; 

II  left  top 

v  [  4  5  ]  .  x  =  left; 

v  [  4  5  ]  .  y  =  t  o  p _  r  o  w8 ; 

II  right  bottom 
v  [  4  6  ]  .  x  =  right; 
v [  4 6 ]  .  y  =  bottom; 

II  right  top 
v[  47]  .  x  =  right; 
v[  47]  .  y  =  t  o p _ r  ow8; 

//quad  12  B4 
II  left  bottom 
v  [  4  8  ]  .  x  =  left; 
v [  4 8 ]  .  y  =  bottom; 

II  left  top 

v  [  4  9  ]  .  x  =  left; 

v[ 49]  .  y  =  t  o p _ r  o w5 ; 

II  right  bottom 
v  [  5  0  ]  .  x  =  right; 
v [  5 0 ]  .  y  =  bottom; 

II  right  top 
v  [  5 1  ]  .  x  =  right; 
v[ 51] . y  =  t  o p _ r  o w5 ; 

//quad  13  T2 
II  left  bottom 
v[ 52] .  x  =  I  ef t ; 
v [  5 2 ]  .  y  =  bot_row2; 

II  left  top 
v[ 53] .  x  =  I  ef t ; 
v[  53]  .  y  =  top; 

II  right  bottom 
v  [  5  4  ]  .  x  =  right; 
v  [  5  4  ]  .  y  =  b  o  t  _  r  o  w2 ; 

II  right  top 
v  [  5  5  ]  .  x  =  right; 
v[  55]  .  y  =  top; 

QuadVB- > U n I  o c k ( ) ; 

II  set  vertex  declaration 

Devi  ce->SetVertexDecl  arati  on(m  pDecI  ); 

/  /  s  e  t  g  e  o  me  t  r  y 

Devi  ce  -  >Set  St  reamSourcef  0,  QuadVB,  0,  si  zeof  (  CUSTOMVERTEX) ) ; 

II  Devi  ce  -  >Set  S  a  mp  I  er  State]  1,  D3DSAMP  ADDRESSU,  D3DT  ADDRESS  CLAMP) ; 

II  Devi  ce  -  >Set  S a  mp  I  er State] 1,  D3  DS A MP " A D D R E S S V ,  D3 DT ADDRESS' CL  AMP ) ; 

return  true; 

}/  /  Setup( ) 


pr  i  vat  e: 


II . 

II  .  LOAD  I  NPUT  SCENE  - 

II . 

bool  Loadl  nput Scene) f I  oat  p  i  nput Ar r a y [ ] )  { 
RECT  SurfRect; 

Surf  Rect .left  =0; 

Sur  f  Rect . t  op  =0; 

Surf  Rect  .right  =  SceneSi  zeX; 

Sur f Rect .  bot t om  =  SceneSi zeY; 


HRESULT  hr  =  0; 

hr=  D3DXLoadSurf  a  c  e  F  r  omMemo  r  y  ( 

Scene  Surface, 

0, 

0, 

p  i  nput  Ar  r  ay, 

D3DFMT  R32F, 

( 4*SceneSi  zeX) ,  II 16  for  4x32- bi  t ,  8  for  2  x  3  2  -  b  i  t ,  4  for  1  x  3  2  F  format 
0, 

SiSur  f  Rect , 

D3DX  FI  LTER  NONE, 

0);  " 

i  f  ( F  A I  LED(  hr) ) 

return  false; 
return  true; 

}/  /  Loadl  nput  Scene! ) 


II  bool  MAddReduce!) 

II 

bool  MAddReduce!  i  nt  *  P  retlndex,  i  n  t  *  p  xdisp,  i  n  t  *  p  ydisp){ 
HRESULT  hr  ="0; 


II  -  -  -  -  set  PS  and  VS  shaders 

Devi  ce- >SetVertexShader(VSl  maddreduce); 

Devi  ce  -  >Set  Pi  xel  Shader!  PSl_maddreduce) ; 

II  set  VS  Pi xel  Si  ze  const 

II  .x  =  1/reticle  i  mg  width,  ,y  =  1/ scenewi  dt  hX,  ,z  =  1/  scenewi  dt  hY,  ,w  =  O.Of 
hr  =  VS1  VSCT- >SetVector(  Devi  ce,  VS1  Pi  xel  Si  zeHandl  e, 

&D3 DXVE CTOR4 ( 1.  Of / ( f I  oat ) Ret i cl  eSi ze,  1 .  Of  /  ( f  I  oat ) Sc eneSi z eX, 
1 .  Of  /  ( f  I  oat  (SceneSi  z  eY,  0.0f)); 

i  f  ( F  A I  LED!  hr) ) 

return  false; 

//was  2/  SceneSi  ze  and  1/  SceneSi  ze 
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II  set  Aspect  in  VS;  allows  for  arbitrary  scene  dimensions 
D3DXVECTOR2  aspect; 

aspect. x  =  ( f  I  oat ) Ret  i  cl  eSi  z  e / ( f I  oat ) SceneSi zeX;  / / 0 . 5  f ; 
aspect. y  =  ( f  I  oat ) Ret i  cl  eSi  z  e / ( f I  oat ) SceneSi zeY;  / / 0 . 5  f ; 

hr  =  VS1  VSCT- >Set FI  oat Ar r a y ( Devi ce, 

VS1  Aspect  Handl  e, 

( f I  oat  * ) as  pect ,  2) ; 

i  f  ( F  A I  LED(  hr) ) 

return  false; 

II  ----  set  RT 

hr  =  Devi  ce- >Set RenderTarget ( 

0, 

RT  Surface); 

i  f  ( FAI  LED(  hr) ) 

return  false; 

II  Devi  ce  -  >CI  ear(  0,  0,  D3 DCL EAR_ TARGET, 0L,  0, 0) ; 

Devi  ce- >Set St r  e a mS o u r  c e { 0,  QuadVB,  0,  si  zeof ( CUSTOMVERTEX) ) ; 

D3DVI  EWPORT9  vp; 

vp.  Wi  dt  h  =  Reti  cl  eSi  ze/2; 
vp.  Hei  ght  =  Reti  cl  eSi  ze/2; 
vp.  Mi  nZ  =  0.  Of ; 
vp.  MaxZ  =  1.  Of ; 

De v i  c e- >Set Text u r e (  1,  Scene_Tex) ; // stage  1  =  input  scene 

D3DXVECTOR2  di  spl  acement ; 

float  f  di  s  px; 

float  f  "di  s  py; 

float  ha  I  f  Ret  X; 

float  h  a  I  f  Ret  Y; 

i  nt  i  ndex  =  0; 

for  ( i  nt  v  =0;  v  <5 ;  v++)  { 

for  ( i  nt  h  =0;  h<8;  h++)  { 

vp.  X  =  h * R e t  i  cl  eSi  z e /  2; 

vp.  Y  =  v * R e t  i  cl  eSi  z e /  2; 

Devi  ce-  >Set  Vi  ewpor  t ( &vp) ; 

II  set  sampler  0  with  reticle  image  for  maddredux  with  scene 
i  ndex  =  v * 8  +h ; 

Devi  ce- >Set Text ur e(  0,  Reti  cl  e_Tex[  p_r et I  ndex[ i  ndex] ]  ); 

//new  displace  me nt  vector  added  for  conscan  x/y  offset  in  VS 
f  dispx  =(  f  I  oat )  p  _  x  d  i  sp[  i  ndex] /( f  I  oat )  SceneSi  zeX; 
f  ”  d  i  spy  =  (float)p  ydi  sp[  i  ndex] /( f  I  oat )  SceneSi  zeY; 
h  a  I  f  Re  t  X  =  ( f  I  oat )  [Ret  i  cl  eSi  z  e  /  2)  /  ( f  I  oat )  SceneSi  zeX; 
ha  I f  Ret  Y  =  ( f  I  oat ) ( Ret  i  cl  eSi  ze/ 2) / ( f I  oat ) SceneSi  zeY; 

displacements  =  0 .  5  f  +  f  dispx  -  halfRetX  +1.  0f/(  2.  0f*(fl  oat)  SceneSi  zeX) ;  ; 

di  spl  acement .  y  =  0 .  5  f  -  f  ~  d  i  spy  -  halfRetY  +  1 .  Of  /  ( 2 .  Of  *  ( f  I  oat )  Sc  e  neSi  z  e  Y ) ; 

hr  =  VS1  VSCT- >Set F I  oa t Ar r ay ( Dev i  c e , 

VS  1  Di  spl  acement  Handl  e, 

( f  I  o  a  t  * )  d  i  s  p  I  a  c  e  me  n  t , 

2); 

i  f  ( FAI  LED(  hr) ) 

return  false; 

II  render--  madd  scene  with  a  single  reticle  image 
Devi  ce- >Begi  nScene( ) ; 

Devi  ce- >Dr  a  wP  r  i  mi  t  i  ve(  D3DPT  TRI  ANGLESTRI  P,  0,  2); 

Devi  ce- >EndScene( ) ; 

} 

} 

II  set  streamsource  to  8x5  rectangle  r  1  - r 5 ,  Quad  1 

Devi  ce- >Set St r  e  a  mS  o  u  r  c  e  (  0,  QuadVB,  4*si  zeof ( CUSTOMVERTEX) f  si  zeof ( CUSTOMVERTEX) ) ; 
return  true; 

}  II  MAddReduce( ) 

II . 

II  R  e  d  u  x ( ) 

II  16:1  REDUCE  OPERATI  ON 

II 

II . 

bool  R e d u x ( )  { 

HRESULT  hr  =  0; 

II  -  -  -  -  set  PS  and  VS  shaders 

Devi  ce  -  >SetVertexShader(VS2  16tapreduce); 

Devi  ce- >Set  Pi xel  Shader  ( PS2“16tapreduce) ; 

II  initial  source  tex  is  result  of  maddreduce  op 
De v i  c e- >Set Text u r e (  0,  RT _ T e x ) ; 

for  (int  i  =  0;  i  <  Reducel  t  er  at  i  ons;  i ++)  { 

hr  =  Devi  ce- >Set RenderTarget (  0,  RT  Reduce  Surface[i ]); 
i  f  ( FAI  LED(  hr) ) 

return  false; 

if  (i  >0) 

Devi  ce- >Set Text ur e(  0,  RT_Reduce_Tex[ i - 1] ) ; 

II  set  VS  offset  constant  array 

hr  =  VS2  VSCT- >Set F I  oat Ar r ay (  Device, 

VS2  of  f  set  Handl  e, 

( f  I  oat  *) &of f set [ i ] [ 0] , 

16  );//  2*8  f I  oats 

i  f  ( FAI  LED(  hr) ) 

return  false; 

II  set  PS  offset  constant  array 

hr  =  PS2  PSCT- >Set F I  oat Ar r ay (  Device, 

PS2_of  f  set  Handl  e, 
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(float*)  &of  f  set  [  U  [  8] , 
16  ) ; II  2*8  f I  oats 

i  f  ( F  A I  LED(  hr) ) 

return  false; 

II  render--  do  16:1  reduction 

II  Devi  ce  -  >CI  e a r ( 0, 0,  D3DCLEAR  TARGET,  0 L ,  0, 0) ; 

Devi  ce- >Begi  nScene( ) ; 

Devi  ce- >Dr  a  wP  r  i  mi  t  i  ve(  D3DPT  TRI  ANGLESTRI  P,  0,  2); 

Devi  ce- >EndScene( ) ; 

} 

return  true; 

}  II  Re d  u x ( ) 


II  Get  RTDat  a  hybri  d( ) 

II  retrieve  FT NAL  data  from  lockable  render-to  surface 


void  Get  RTDat  a  hybri  d(doubl  e  p  b[])  { 

HRES0LT  hr  =  0; 

i  nt  q; 
i  nt  col  ; 
i  nt  row; 

D3 DL OCKE D_  RE CT  I  oc  kedRect ; 

Devi  ce- >Get  RenderTarget  Data( RT  Reduce  Surf  a  c  e [ Texl  ndex] ,  Ones  Surface); 

Ones  Surface- >LockRect( &l ockedRect, 

0,  //lock  entire  tex 

D3 DL OCK_ READONLY  ) ;  //flags 

float*  imageData  =  (float*)  I  ockedRect.  pBi  ts; 

//perform  final  4:1  reduction  if  necessary  and  add  4  components  of  each  pixel 
if  (  Out  TexSi z  e  >8) { 

for  ( i  nt  i  =0;  i  <40;  i  ++)  { 

row  =  i  /  8; 
col  =  i  %8 ; 
q  =  row*  32  +  col  *  2 ; 

p_b[i]=  i  mageDatafq]  +  i  mageDat  a[  q+1]  +  i  mageDat  a[  q+16]  +  i  mageDat  a[  q  +1 7  ] ; 

}  } 

else  { 

for  ( i  nt  i  =0;  i  <40;  i  ++)  { 

p_b[  i  ]  =i  mageDat  a[  i  ] ; 

}  } 

Ones_Surf  ace- >U n I  ockRect ( ) ; 
return; 

}//  Get  RTDat  a _ h y b r i  d( ) 


/  /  Rel  e a s e ( )  and  Del  e t  e ( ) 

II  cleanup  functions 


tempi  ate<cl  ass  T>  void  Rel  e  a  s  e  (  T  t)  { 
if(  t  )  { 

t  -  >Rel  e a s e ( ) ; 
t  =  0; 


tempi  ate<cl  ass  T>  void  Del  e  t  e  (  T  t){ 
i  f  (  t  ){ 

delete  t ; 
t  =  0; 

} 

} 


II  Cl  e a  n u p ( ) 

II  releases  textures/surfaces/interfaces/devices  / me  mo  ry 

II  al  I  ocat  ed  dur  i  ng  pr  ogr  am 


void  Cl  eanup( ) 

//vertex  buffer  and  declaration 

Rel  e  a  s  e  <  I  Di  rect3DVertexBuffer9*>(  QuadVB) ; 

Rel  ease<LPDI  RECT3DVERTEXDECLARATI  O  N  9  >  (  m_  p  Dec  I  ) ; 

II  textures  and  surfaces 

Rel  e  a  s  e  <1  Di  rect  3DSurface9*>(  Scene  Surface); 

Rel  e  a  s  e  <  I  Di  rect  3DText  ure9*>(  Scene_Tex) ; 

i  nt  n  =  100; 

for  ( i  nt  t  =0;  t  <n;  t  ++)  { 

Rel  e  a  s  e  <1  Di  rect  3DSurface9*>(  Ret  i  cl  e  Surface[t]); 
Released  Direct3DTexture9*>(Reticle"Tex[t]); 

} 


Rel  e  a  s  e  <  I  Di  rect  3DText  ure9*>(  RT  Tex) ; 

Rel  e a s e < I  Di  rect3DSurface9*>(RT~Surface); 


for  ( i  nt  t  =  0;  t<Reducel  terati  ons;  t ++)  { 

Rel  e  a  s  e  <  I  Di  r  ect  3  DS  u  r  f  a  c  e  9  *  >(  RT  Reduce  Surface[t]); 
Rel  e a s e < I  Di  rect  3DText  ure9*>(  RT-Reduce-Tex[  t  ] ) ; 

} 


Rel  e  a  s  e  <  I  Di  rect  3DText  ure9*>(  Ones  Tex); 

Rel  e a s e <1  Di  rect3DSurface9*>(Ones~Surface); 

II  PS  &  VS 
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Re  I  e  a  s  e  <1  Di  r  e  c  1 3  D  P  i  xel  Shader9*>(  PS1  maddreduce); 

Re  I  ease  <1  D3  DXCo  n  s  t  ant  Tab  I  e*>(  PS1  PSCT) ; 

Rel  e a s e <1  Di  rect3DVertexShader9*>[VSl  maddreduce); 
Re  I  ease  <1  D3  DXCo  n  s  t  ant  Tab  I  e*>(  VS1  VSCT) ; 

Rel  e  a  s  e  <  I  Di  r  e  c  1 3  D  P  i  xel  Shader9*>(PS2  16tapreduce); 
Rel  ease <1  D3  DXCo  n  s  t ant  Tab  I  e*  >( PS2  PSCT) ; 

Rel  e  a  s  e  <  I  Di  rect3DVertexShader9*>[VS2  16t  apreduce) ; 
Rel  ease <1  D3  DXCo  n  s  t ant  Tab  I  e*  >( VS2_ VSCT) ; 

Devi  ce- >Rel  eas e( ) ; 

}//  Cleanup!) 

public: 


II  upl  oadReti  cl  e 


bool  upl  oadReti  cl  e(i  nt  p  index,  float  p  a  r  r  a  y  [  ] )  { 

HRESULT  hr  =  0" 

RECT  srcRect; 
s  r  c  Rect . t  op  =  0; 

s  r  c  Rec  t .  bot  t  o  m  =  ReticleSize;  II  1 2  for  CONSCAN 
s  r  c  Rect .  I  ef  t  =0; 

s  r  c  Rec  t .  r  i  g  ht  =  ReticleSize;  //change  from  SceneSize/2  to  ReticleSize 

hr=  D3DXLoadSurf  a  c  e  F  r  omMemo  r  y  ( 

Ret i  cl e  Surface! p  i  ndex] , 

0, 

0, 

p  array, 

D3DFMT  R32F, 

( 4*  Ret  f  c  I  eSi  ze) ,  II 16  for  4x32-  b  i  t ,  8  for  2  x  3  2  -  b  i  t ,  4  for  1  x  3  2  F  format 
0,  II  SceneSi ze/ 2  for  CONSCAN 

Sis  r  c  Rect , 

D3DX  FI  LTER  NONE, 

0);  " 

i  f  ( F  A I  LED(  hr) ) 

return  false; 

return  true; 

} 


II  Process!) 

II  user  interface  to  GPU  algorithm 

II  input:  reference  to  scene  image  array  variable--  scene!  SceneSi  zeX  *SceneSi  zeY] 
II  input:  references  to  arrays  in  calling  program: 

II  ret i  cl  el  ndex[ 40] , ,  x  d i s  p [ 4  0] ,  y  d i s  p [ 40] ,  r esul t Ar r a  y [ 40] 

II  output:  double  result  ar  r  a  y  [  40]  -  -  out  put  to  J  MASS 

void  Process!  i  nt  p  retl  ndex[],  float  p  SceneAr  ray!  ] ,  double  p  resul  tArray!  ] , 
i  nt  p  xdi  s p[  ] ,  i  nt  p  ydi  s p[  ] )  { 

Loadl  nput Scene!  p  SceneAr  ray) ; 

MAddReduce!  p  retf  ndex,  p  xdisp,  p  ydisp); 

R  e  d  u  x  ( ) ; 

Get  RTDat  a  hybrid(p  r  es  ul  t  Ar  r  ay ) ; 
return; 

}//  Process!) 

i  nt  Get  Al  g( )  { 

return  3 ; / / o n e  by  one,  R32F,  CONSCAN 


II  ~  Gpu()  DESTRUCTOR 

-GpuConscan! )  { 

Cleanup!); 

}//  ~  Gpu()  DESTRUCTOR 

}; 

#endi f  II  GP  U_  CL  AS  S_  H_  BY_  MAJ _J  EFFERS 
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II 

II  fil 
II  BY 
II  11 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 


vs  CONSCAN.  txt 
SEAN  j  EFFERS 


(adapted  from  vs_experi  mental  ) 


04 


2  j  an  05 


5  j  a n  05 


multiplies  lxl  scene  by  8x8  reticle  pallette,  then 
4:1  redux;  results  in  RT  that  is  quarter  sized  of 
sampling  of  scene  done  w/ wrapping 
PixelSize.x  =  1/ret  pallette  width 
Pi  xel  Si  ze.  y  =  1/  scene  wi  dt  h 
PixelSize.z  =  O.Of  (must!) 

VS  gener  at  es  8  t  excoords  f  or 
scene  is  4x  RT  width;  ret  is 
i  mpl  ement  s  CONSCAN 
PixelSize.w  =  x-displacement 
=  x/scene  width  ,  where 
Pi  xel  Si  z  e.  x"  =  1/  r  et  wi  dt  h 
Pi  xel  Si  ze.  y  =  1/  scene  wi  dt  h 

modified  to  have  separate  displace  me  nt  and  aspect 


PS 

2x  RT  wi  dt  h 

t  excoor  d  wr  t  scene 
is  pixel  displace  me  nt 


uni  f  or  m  f  I  oat  4  Pi  xel  Si  ze; 
uniform  f  I  o  a  1 2  Displace  me  nt; 
uniform  f  I  oat  2  Aspect ; 


II  structures 


struct 

{ 


}; 


struct 

{ 


}; 


VS_I  NPUT 

f  1  oat  4 

Pos  : 

POSI  Tl  ON; 

VS_  OUTP  UT 

f  1  oat  4 

Pos  : 

POSI  Tl  ON; 

f  1  oat  2 

Tex  : 

TEXCOORDO 

f  1  oat  2 

Texl: 

TEXCOORD1 

f  1  oat  2 

Tex  2 : 

TEXCOORD2 

f  1  oat  2 

Tex  3 : 

TEXCOORD3 

f  1  oat  2 

Tex4: 

TEXCOORD4 

f  1  oat  2 

Tex  5 : 

TEXCOORD5 

f  1  oat  2 

Tex  6 : 

TEXCOORD6 

f  1  oat  2 

Tex7 : 

TEXCOORD7 

II  vertex  shader  function  (input  channels) 


VSOUTPUT  Mai  n ( VS_ I  NPUT  i  nput ) 

VSOUTPUT  output  =  ( VSOUTPUT) 0; 

out  put .  Pos .  xy  =  i  nput .  Pos .  xy;  /  /  +  Pi  xel  Si  ze.  xy; 
out  put .  Pos . z  =  0 .  5 f ; 
out  put .  Pos .  w  =  1.  Of ; 

//reticle  tex  coords  (ret  width  =  2x  RT  width) 

output. Tex  =  float2(0.5f,  -  0 .  5  f )  *  input. Pos. xy  +  0 .  5  f .  x  x  ; 

output. Texl  =  output. Tex  +  Pi  xel  Si  ze.  xw; 

output.  T  e  x  2  =  output. Tex  +  PixelSize.wx; 

output.  T  e  x  3  =  output. Tex  +  Pi  xel  Si  z  e .  xx; 

II  scene  tex  coords  (scene  width  =  4x  RT  width) 

output. Tex4  =  Aspect  *out  put .  Tex  +  Di  s  pi  a  c  e  ment ;  /  /  was  0 .  5  f  *  a  d  d  s 

output.  T  e  x  5  =  output. Tex4  +  Pi  xel  Si  ze.  yw; 

output.  T  e  x  6  =  output. Tex4  +  Pi  xel  Si  ze.  wz;  II  y  now  SceneSi  zeX 

output. Tex7  =  output. Tex4  +  Pi  xel  Si  ze.  yz;  //changed  z  t  o  w,  z 

returnoutput; 


does 

pallette; 

{0.  .  s c e n e _ wi  dt  h- 1} 
const  s 


in  x-  di  s  p 
now  SceneSi  zeY 
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II  file:  ps  CONSCAN.  txt 
II 

II  depends  on:  file  vs  onebyone.txt,  GPU  CLASS  ONEBYONE  R32F.h 
//  By:  Maj  Sean  Jeffers 

II  descr:  modified  version  of  ps  maddreduce  new  for  CONSCAN 

II  PS  for  mult,  add,  reduce,  4:1;  one-  by-  one  approach  using  R  3  2  F  textures  only 

II 

II  27  dec  04  --  use  with  GPU  CLASS  ONEBYONE  R32F.h  for  CONSCAN 

II  does  maddreduce  op’with  input  tex's  R32F,  output  tex  R32F 

II  --  "dot"  approach  seems  to  work  a  little  faster  than  other 

II  commented  out  approach;  but  both  work 

II  -  -  eliminated  "noise"  caused  by  dot  product  by  assigning  tex  samples 

II  to  individual  float  vector  components  vs.  full  f  I  o  a  1 4 

II  2  jan  05  --  modified  to  do  conscan  approach;  accept  8  texcoords,  all  R32F  tex's 

II  . 

II  gl  obal  s 
II 


sampler  Render  samp  I  er ;  //reticle  i  mg  ( 2  x  RT  width) 
sampler  Render  sampl  er  2; // scene  i  mg  (  4  x  RT  width) 


II  . 

II  structures 

II  . 

struct  PS  I  NPUT 

{ 

f I  oat  2  TexO  :  TEXCOORDO; 
f I  oat  2  Tex  1  :  TEXCOORD1; 
f I  oat  2  Tex 2  :  TEXCOORD2 ; 
f I  oat  2  Tex 3  :  TEXCOORD3; 
f I  oat  2  Tex4  :  TEXCOORD4; 
f I  oat  2  Tex 5  :  TEXCOORD5; 
f I  oat  2  Tex 6  :  TEXCOORD6; 
f I  oat  2  Tex7  :  TEXCOORD7 ; 

}; 

II  struct  PS  OUTPUT 

II  { 

II  f I o a t  4  cl  r  :  COLOR;  II  was  COLORO 

II  }; 

II  . 

II  Pixel  Shader  (input  channel  s) :  out  put  channel 
II  . 

f I  oat  4  PSMai  n( PS_I  NPUT  i  nput)  : COLOR 

f  I  oat  4  1 1; 

tl.r  =  t  e  x  2  D(  Render  sampl  er ,  input. TexO); 

tl.g  =  t  e  x  2  D  (  Render  sampl  er ,  input. Texl); 

tl.b  =  t  e  x  2  D(  Render  sampl  er ,  input.  T  e  x  2 ) ; 

tl.a  =  t  e  x  2  D(  Render  sampl  er ,  input.  T  e  x  3 ) ; 

f  I  oat  4  1 2 ; 

1 2 .  r  =  t  e  x  2  D(  Render  sampl  er  2,  input.  T  e  x  4 ) 

1 2 .  g  =  t  e  x  2  D  (  Render  sampl  er  2,  input.  T  e  x  5 ) 

1 2 .  b  =  t  e  x  2  D(  Render  sampl  er  2,  input.  T  e  x  6 ) 

1 2 .  a  =  t  e  x  2  D(  Render  sampl  er  2,  input. Tex7) 


II  madd  ret  (tl)  &  scene  ( 1 2 ) 


return  dot  ( 1 1,  1 2) ; 

} 
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#i  f  ndef  GPU  UTI  LI  TY  BY  MAJ  j  EFFERS 
#def  i  ne  GPU“UTI  LITY“BY“MAJ“J  EFFERS 

namespace  d  3  d  { 


LRESULT  CALLBACK  G  p  u  _  Wn  d  P  r  o  c  (  HWND  hWnd,  Ul  NT  message,  WP  ARAM  wParam,  L  P  ARAM  IParam) 
return  Def  Wi  ndowPr  oc  ( hWnd,  message,  wParam,  IParam); 


ATOM  G  p  u  _  Wn  d  C I  a  s  s  (  HI  NSTANCE  hlnstance  )  { 


WNDCLASSEX  wcex; 
wc ex .  c bSi  z e 
wcex.  st  yl  e 
wcex.  I  pfnWndProc 
wcex.  cbCI  sExtra 
wcex.  cbWndExtra 
wcex.  hi  nst  ance 
wcex.  hi  con 
wcex.  hCu r  s o r 
wcex.  hbr  Bac  kgr  ound 
wcex.  1  pszMenuName 
wcex.  1  ps z Cl  a s s Na me 
wcex.  hi  conSm 


si  zeof  (  WNDCLASSEX) ; 

CS  HREDRAW  I  CS  VREDRAW; 

(  WNDPROC)  d 3 d :  :  Gpu  WndProc;  II  (  WNDPROC) 

0; 

0; 

hi  nst  ance; 

Loadl  c  o  n  (  hi  nst  ance,  (LPCTSTR)IDI  WINCONSCAN); 
LoadCursor(  NULL,  I  DC  ARROW)  ; 

(  HBRUSH)  (  COLOR  WI  NDOW+1)  ; 

0 ;  /  /  n  o  me  n  u 
"Gpu  Wn  d  C I  ass"; 

Loadl" con(  wcex.  hi  nst  ance,  (  LPCTSTR)  I  Dl  _ S MA L L )  ; 


return  Regi  sterCI  assEx(Stwcex); 


bool  I  n  i  t  D3  D(  HI  NSTANCE  hlnstance,  i  nt  nCmdShow, 

int  width,  int  height, 
bool  wi  ndowed, 

D3DDEVTYPE  devi ceType, 

I  Di  r  ect  3  D  De  v  i  c  e  9  *  *  devi  ce) 

{ 

II  cr  eat  e  GPU  wi  ndow 

d 3 d :  :  Gpu  WndCI  ass(  hi  nstance) ; 

HWND  h  Wnd  2  =  Cr  eat  eWi  ndow( 11  Gpu  WndCI  ass",  "GPU",  WS  OVERLAPPEDWI  NDOW, 

CW  USEDEFAULT,  0,  CW'USEDEFAULT,  0,  NULL, "NULL,  hlnstance,  NULL); 

i  f  (  !  h Wn d 2 )  {" 

return  FALSE; 

} 

1 1  ShowWi  n  d  o  w(  h  Wn  d  2 ,  nCmdShow); 

II  UpdateWi  ndow(  h Wn d 2 ) ; 


II  I  ni  t  D3D: 

II 

HRESULT  hr  =  0; 

II  Step  1:  Create  the  I  Di  rect3D9  object. 

I  Di  r  ect  3 D9 *  d 3 d 9  =  0; 

d  3  d  9  =  Di  rect  3DCr  eat  e9(  D3D_SDK_VERSI  ON) ; 

i  f  (  !  d  3  d  9  ) 

{ 

:  :  Me  ssageBox(  0,  "  Di  r  ect  3  DC  r  eat  e  9  ( )  -  FAILED",  0,  0); 
return  false; 

} 


II  Step  2:  Check  for  hardware  vp. 


D3DCAPS9  caps; 

d  3  d  9  -  >Get Devi  ceCapsf D3DADAPTER_  DEFAULT,  devi ceType,  &caps); 
int  vp  =  0; 

i  f  (  caps.  DevCaps  &  D3 DDE VCAP S  HWTRANSFORMANDLI  GHT  ) 

vp  =  D3 DCREATE  HARDWARE  VERTEXPROCESSING;  //SOFTWARE  if  debug 

else 

vp  =  D3DCREATE_SOFT WA RE  VERTEXPROCESSI  NG; 


II  Step  3:  Fill  out  the  D3DPRESENT_PARAMETERS  structure. 


D3DPRESENT  PARAMETERS  d 3 d p p ; 
d 3 d p p .  BackBufferWi  dth 
d 3 d p p .  BackBuf  f  er  Hei  ght 
d 3 d p p .  BackBufferFormat 
d 3 d p p .  Bac  kBuf  f  er  Count 
d 3 d p p .  Mul  t  i  Sampl  eType 
d  3  d  p  p .  Mul  t  i  Sampl  e  Qu  a  I  i  t  y 
d 3 d p p .  SwapEf  f  ect 
d 3 d p p .  hDevi  ceWi  ndow 
d 3 d p p .  Wi  ndowed 

d 3 d p p .  Enabl  eAut  oDept  hSt  enci I 
d 3 d p p . Aut  oDept  hSt  enci I  For  mat 
d 3 d p p .  FI  ags 

d3dpp.  Ful  I  Screen  RefreshRatel  nHz 
d  3  d  p  p .  Presentation!  nt  er  val 

II  Step  4:  Create  the  devi  ce. 


=  wi  dt  h; 

=  height; 

=  D3DFMT  X8R8G8B8; 

=  l; 

=  D3DMULTI  SAMPLE  NONE; 

=  0; 

=  D3DSWAPEFFECT  DI  SCARD; 

=  h  Wn  d  2 ; 

=  wi  ndowed; 

=  false; 

=  D3DFMT  D24S8; 

=  0; 

=  D3DPRESENT  RATE  DEFAULT; 

=  D3DPRESENT"  I  NTERVAL  IMMEDIATE; 


hr  =  d 3 d 9 - >Cr  eat  eDevi ce( 

D3DADAPTER  DEFAULT,  //  primary  adapter 

devi  ceType?  II  devi  ce  t ype 

h Wn d 2 ,  II  window  associated  with  device 

vp,  II  |  D3DCREATE  FPU  PRESERVE,  //  vertex  processing 

&d 3 d p p ,  “  //“present  parameters 

device);  //return  created  device 


iff  FAILED(hr)  ) 

{ 

II  try  again  using  a  16-bit  depth  buffer 
d  3  d  p  p .  Aut  oDept  hSt  enc  i  I  For  mat  =  D3DFMT_D16; 
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hr  =  d 3 d 9 - >C r  eat eDevi ce( 

D3DADAPTER  DEFAULT, 
devi  c e T y p e 7 
h  Wn  d  2 , 
vp, 

&d3dpp, 

device); 

i  f  (  FAILED(hr)  ) 

{ 

d  3  d  9  -  >Rel  eas  e( ) ;  II  done  with  d  3  d  9  object 
::MessageBox(0,  "  Create  Devi  c  e  ( )  -  FAILED",  0,  0); 
return  false; 


d  3  d  9  -  >Rel  easel ) ;  II  done  with  d  3  d  9  object 
return  true; 

} 


}  II  n  a  me  s  p  a  c  e  d  3  d 

#endi  f  /  /  GP U_ UTI  LI  TY_  BY_  MAJ  _J  EFFERS 
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